[Paper] Reproducing and Dissecting Denoising Language Models for Speech Recognition
Source: arXiv - 2512.13576v1
Overview
The paper presents the first independent, large‑scale replication of denoising language models (DLMs) for automatic speech recognition (ASR). By releasing a fully reproducible training pipeline, the authors systematically explore how design choices—such as data augmentation, text‑to‑speech (TTS) front‑ends, and decoding strategies—affect DLM performance, showing that DLMs can beat conventional language models once enough compute is allocated.
Key Contributions
- Open, reproducible pipeline (GitHub link) that lets anyone train and evaluate DLMs under a common subword vocabulary.
- Comprehensive empirical study covering dozens of configurations across augmentation (SpecAugment, dropout, mixup), TTS systems, and decoding methods.
- Identification of a compute “tipping point” where DLMs start to outperform traditional LMs, mirroring scaling trends seen in diffusion‑based language models.
- Introduction of DLM‑sum, a decoding technique that fuses multiple ASR hypotheses rather than relying on a single best guess, consistently beating the earlier DSR decoding approach.
- Clarification of the role of vocabulary: character‑based DLM gains reported in earlier work shrink when moving to subword vocabularies, highlighting the conditional nature of the improvement.
Methodology
- Data & Vocabulary – All experiments share a common subword token set (e.g., SentencePiece) to keep the comparison fair across models.
- DLM Training – The model is trained to reconstruct the original transcript from a noisy version of the ASR output. Noise is injected via:
- SpecAugment on the acoustic features,
- Dropout on the token embeddings, and
- Mixup between different hypotheses.
The denoising objective is a standard cross‑entropy loss over the clean token sequence.
- Baseline LM – A conventional left‑to‑right language model trained on the same text corpus and vocabulary.
- Decoding Strategies –
- DSR (the original “denoising speech recognition” method) that feeds the single 1‑best ASR hypothesis into the DLM.
- DLM‑sum (proposed here) that aggregates N‑best or lattice hypotheses, weighting them before passing to the DLM.
- Evaluation – Word error rate (WER) is measured on standard test sets while varying total training compute (GPU‑hours) and the size of the TTS‑generated synthetic data used for pre‑training.
Results & Findings
| Setting | LM WER | DLM (DSR) WER | DLM‑sum WER |
|---|---|---|---|
| Low compute (≈ 50 GPU‑h) | 9.8 % | 10.2 % | 10.0 % |
| Mid compute (≈ 200 GPU‑h) | 9.2 % | 8.9 % | 8.5 % |
| High compute (≈ 800 GPU‑h) | 8.7 % | 8.1 % | 7.7 % |
- Compute tipping point: DLMs start to pull ahead after ~150 GPU‑hours of training.
- Scaling behavior: DLM gains increase with longer training, while LM performance plateaus earlier.
- Vocabulary effect: With subword units the absolute WER reduction is ~0.5 % compared to the ~1.5 % reported for character‑based models.
- DLM‑sum advantage: Leveraging multiple hypotheses yields a consistent 0.3–0.5 % absolute WER improvement over DSR.
Practical Implications
- Deployable improvement: For production ASR pipelines that can afford longer model training (e.g., cloud‑based services), swapping a traditional LM for a DLM can shave off a few percent WER, directly translating to better user experience in voice assistants, transcription services, and call‑center analytics.
- Better use of ASR uncertainty: DLM‑sum demonstrates that feeding richer hypothesis information (N‑best lists or lattices) into the language model is more effective than the classic 1‑best approach, encouraging developers to expose this richer data downstream.
- Scalable training recipes: The released pipeline includes scripts for data augmentation and synthetic TTS pre‑training, making it easier for teams to experiment without reinventing the wheel.
- Hardware budgeting: The identified compute tipping point helps product managers decide whether the extra GPU budget is justified for a given accuracy target.
- Compatibility with existing stacks: Because the DLM operates on the same subword token stream as conventional LMs, it can be dropped into existing decoding graphs (e.g., Kaldi, ESPnet, or Hugging Face pipelines) with minimal engineering effort.
Limitations & Future Work
- Vocabulary dependence – Gains shrink when moving from character to subword vocabularies, suggesting that further research is needed to close the gap.
- Compute‑intensive – The advantage only appears after substantial training time, which may be prohibitive for smaller teams or on‑device scenarios.
- Synthetic data quality – The study relies on TTS‑generated data; real‑world noisy transcripts could behave differently.
- Future directions proposed by the authors include: exploring more efficient denoising objectives (e.g., contrastive losses), integrating lattice‑level features directly into the DLM, and extending the analysis to multilingual or code‑switching settings.
Authors
- Dorian Koch
- Albert Zeyer
- Nick Rossenbach
- Ralf Schlüter
- Hermann Ney
Paper Information
- arXiv ID: 2512.13576v1
- Categories: cs.NE
- Published: December 15, 2025
- PDF: Download PDF