HuggingFace benchmarks code-switched ASR: OpenAI, Google, Meta fail hard

HuggingFace

June 10, 2026

◷ 3 MIN

Original source

huggingface.co — read the full announcement →

HuggingFace's New Code-Switched Benchmark

HuggingFace just released a benchmark specifically for code-switched speech — where speakers mix two languages mid-sentence, like "I need to go to the mercado, then we'll see." They tested five major ASR systems: OpenAI's Whisper (large-v3), Google's Chirp, Meta's MMS, and two open-source models. The dataset spans 10 language pairs, including Hindi-English, Spanish-English, and Mandarin-English, with over 100 hours of real-world bilingual conversations. The results are sobering: the best model, Whisper, hit a word error rate of 22.3% on code-switched segments, compared to 8.1% on monolingual speech. That's nearly triple the error rate. Google's Chirp wasn't far behind at 24.7%. For context, human transcribers achieve around 5% WER on the same data. The benchmark is fully open — code, data splits, and evaluation scripts are on GitHub. No paywall, no API key needed.

Why Code-Switching Stumps Current ASR

The problem isn't new, but it's been ignored. Most ASR datasets are monolingual — LibriSpeech is English, Common Voice is per-language silos. Models learn to expect one language at a time. Code-switching breaks that assumption. When a speaker says "Let's check the presupuesto before we approve anything," the model has to switch language mid-stream, and most fail. The root issue is training data: even massive models like Whisper (1.5 billion parameters) were trained on a corpus that's 65% English, with code-switched utterances making up less than 0.5%. That's not a bug — it's a feature of how data is collected. But the real world doesn't match those stats. In Mumbai, Singapore, Barcelona, and Nairobi, code-switching is the norm, not the exception. HuggingFace's benchmark is the first to quantify just how bad the gap is, and the numbers are ugly.

The Real Cost for Voice Assistants and Bilingual Users

This isn't an academic nitpick. Voice assistants — from Siri to customer service bots — are deployed globally, but they're effectively monolingual at the sentence level. If you're a bilingual speaker in Texas or a call center agent in Manila, you're paying the price. A 22% word error rate means every fourth word is wrong. That's not a minor annoyance; it's a usability disaster. Imagine ordering food through a voice bot, saying "I want two tacos de carnitas and a horchata," and the bot hears "I want two tacos deer carnitas and a orchid." That's what happens. For enterprises running multilingual customer service, this means higher escalation rates, longer call times, and frustrated users. The benchmark suggests that current commercial ASR simply isn't ready for the bilingual world. And since most companies rely on these APIs, they're building on a shaky foundation.

What We Don't Know: Training Data and Model Sensitivity

The benchmark raises more questions than it answers. First, the dataset itself: HuggingFace collected audio from public sources like YouTube and podcasts, but there's no breakdown of speaker demographics, accent variation, or recording quality. Does performance drop more for certain dialects? We don't know. Second, the models tested are all black boxes — we can't see their internal language-switching mechanisms. Is Whisper simply confused, or does it have a hidden language-detection layer that's failing? Third, there's no ablation study: does fine-tuning on 10 hours of code-switched data close the gap, or is the architecture fundamentally wrong? HuggingFace says they'll release a follow-up on fine-tuning, but it's not here yet. Finally, the benchmark only covers 10 language pairs. What about trilingual speakers, or languages with no written script? Those are still invisible. The short version: we now have a clear problem statement, but no clear solution.

Frequently Asked Questions

What is code-switching in speech recognition?▾

Code-switching is when a speaker alternates between two or more languages within a single sentence or conversation, like "I need to call the banco about my account." It's common in bilingual communities but most ASR systems are trained on monolingual data, so they perform poorly on these mixed-language utterances.

Which ASR models did HuggingFace test?▾

They tested five models: OpenAI's Whisper large-v3, Google's Chirp, Meta's MMS, and two open-source models (Whisper small and a fine-tuned variant). Whisper large-v3 performed best overall but still had a 22.3% word error rate on code-switched segments, compared to 8.1% on monolingual speech.

Why is code-switching hard for current ASR systems?▾

Current ASR models are trained on datasets that are overwhelmingly monolingual — Whisper's training data, for example, is 65% English with less than 0.5% code-switched utterances. The models learn to expect one language at a time, so they fail when a speaker switches mid-sentence.

How does this affect real-world voice applications?▾

Voice assistants, customer service bots, and transcription services deployed globally often serve bilingual users. A 22% word error rate means every fourth word is wrong, leading to misorders, frustration, and higher escalation rates. For enterprises, this translates to longer call times and lower user satisfaction.

What are the next steps for improving code-switched ASR?▾

HuggingFace plans to release a follow-up on fine-tuning, but key questions remain: does more code-switched training data help, or is a new architecture needed? The benchmark is open-source, so researchers can experiment. However, the dataset's limited language pairs and lack of demographic detail leave many gaps unaddressed.