[Paper] 'Sorry, I Didn't Catch That': How Speech Models Miss What Matters Most

Published: 3 days ago (February 12, 2026 at 01:36 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.12249v1

Overview

Even though modern speech‑to‑text services boast impressively low word‑error rates on standard tests, they can still stumble on the short, mission‑critical phrases that matter most in everyday life. This paper investigates that gap by measuring how well 15 commercial speech models transcribe U.S. street names spoken by a linguistically diverse set of users. The findings reveal a startling 44 % average error rate and expose disproportionate harms for speakers whose primary language isn’t English.

Key Contributions

Large‑scale real‑world benchmark: Collected and annotated a dataset of street‑name utterances from speakers across multiple language backgrounds in the United States.
Comprehensive model audit: Evaluated 15 state‑of‑the‑art APIs (OpenAI, Deepgram, Google, Microsoft) on the same data, quantifying error patterns.
Impact analysis: Mapped transcription mistakes to geographic routing errors, showing that non‑English primary speakers suffer twice the distance error of native English speakers.
Synthetic data augmentation pipeline: Developed a low‑cost method that uses open‑source text‑to‑speech (TTS) to generate diverse pronunciations of street names.
Effective fine‑tuning: Demonstrated that adding fewer than 1 000 synthetic examples improves transcription accuracy for the hardest demographic by ~60 % (relative gain).

Methodology

Data collection – Recruited a balanced cohort of U.S. participants (English‑first and non‑English‑first speakers) and asked them to read a list of real street names. Recordings were captured in typical indoor/outdoor acoustic conditions.
Ground‑truth labeling – Each audio clip was manually transcribed by linguists to create a gold standard.
Model evaluation – Sent the same audio to 15 commercial speech‑recognition APIs. Transcriptions were compared to the gold standard using word‑error rate (WER) and a custom “street‑name exact‑match” metric.
Downstream impact simulation – Fed mis‑transcribed street names into a routing engine to compute the extra travel distance caused by the error.
Synthetic augmentation – Using open‑source TTS models (e.g., Coqui TTS, Mozilla TTS), generated multiple pronunciations for each street name, varying speaker accent, speaking rate, and background noise.
Fine‑tuning – Updated each commercial model’s public fine‑tuning endpoint (or an open‑source replica) with ≤1 000 synthetic samples, then re‑evaluated on the original test set.

Results & Findings

Metric	Baseline (average across 15 models)	After synthetic fine‑tuning (non‑English speakers)
Word‑Error Rate (WER)	44 %	27 % (≈ 38 % relative reduction)
Exact‑match street‑name accuracy	31 %	49 % (≈ 60 % relative gain)
Average routing distance error	2.3 km	1.1 km (≈ 52 % reduction)

Errors were systematic: most models missed the same phonetic cues (e.g., “Boulevard” vs. “Boulvard”).
Non‑English primary speakers incurred twice the extra travel distance compared with English‑first speakers.
The synthetic augmentation required minimal compute (a few GPU hours) and no real human recordings, yet delivered the largest gains for the hardest‑hit demographic.

Practical Implications

Product teams building navigation, emergency‑response, or delivery apps should not rely solely on benchmark WER; they need targeted validation on short, high‑stakes utterances.
Model providers can improve fairness by incorporating synthetic, accent‑rich data for named entities—especially place names that appear in critical workflows.
The augmentation pipeline is plug‑and‑play: developers can generate a few thousand TTS samples for any domain‑specific vocabulary (e.g., medical terms, legal jargon) and fine‑tune existing APIs, dramatically lowering error rates without costly data‑collection campaigns.
Regulatory and safety considerations: mis‑routing caused by transcription errors could have legal ramifications for autonomous‑vehicle fleets or emergency‑dispatch systems; the paper’s methodology offers a concrete mitigation path.

Limitations & Future Work

The study focuses on U.S. street names; results may differ for other toponymic systems (e.g., non‑Latin scripts, rural address conventions).
Synthetic TTS voices, while diverse, may still miss subtle sociolinguistic nuances present in real speakers (e.g., code‑switching, regional slang).
Fine‑tuning was performed on a limited subset of commercial models; broader access to model internals could yield even larger improvements.
Future research could explore active learning loops where real user corrections continuously enrich the synthetic dataset, and extend the approach to multilingual or code‑mixed utterances.

Authors

Kaitlyn Zhou
Martijn Bartelds
Federico Bianchi
James Zou

Paper Information

arXiv ID: 2602.12249v1
Categories: cs.AI, cs.CL, cs.CY
Published: February 12, 2026
PDF: Download PDF

[Paper] 'Sorry, I Didn't Catch That': How Speech Models Miss What Matters Most

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Agentic Test-Time Scaling for WebAgents

[Paper] T3D: Few-Step Diffusion Language Models via Trajectory Self-Distillation with Direct Discriminative Optimization

[Paper] A technical curriculum on language-oriented artificial intelligence in translation and specialised communication

[Paper] Moonshine v2: Ergodic Streaming Encoder ASR for Latency-Critical Speech Applications