[Paper] Reproducing and Dissecting Denoising Language Models for Speech Recognition

Published: 3 days ago (December 15, 2025 at 12:33 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.13576v1

Overview

The paper presents the first independent, large‑scale replication of denoising language models (DLMs) for automatic speech recognition (ASR). By releasing a fully reproducible training pipeline, the authors systematically explore how design choices—such as data augmentation, text‑to‑speech (TTS) front‑ends, and decoding strategies—affect DLM performance, showing that DLMs can beat conventional language models once enough compute is allocated.

Key Contributions

Open, reproducible pipeline (GitHub link) that lets anyone train and evaluate DLMs under a common subword vocabulary.
Comprehensive empirical study covering dozens of configurations across augmentation (SpecAugment, dropout, mixup), TTS systems, and decoding methods.
Identification of a compute “tipping point” where DLMs start to outperform traditional LMs, mirroring scaling trends seen in diffusion‑based language models.
Introduction of DLM‑sum, a decoding technique that fuses multiple ASR hypotheses rather than relying on a single best guess, consistently beating the earlier DSR decoding approach.
Clarification of the role of vocabulary: character‑based DLM gains reported in earlier work shrink when moving to subword vocabularies, highlighting the conditional nature of the improvement.

Methodology

Data & Vocabulary – All experiments share a common subword token set (e.g., SentencePiece) to keep the comparison fair across models.
DLM Training – The model is trained to reconstruct the original transcript from a noisy version of the ASR output. Noise is injected via:
- SpecAugment on the acoustic features,
- Dropout on the token embeddings, and
- Mixup between different hypotheses.
  The denoising objective is a standard cross‑entropy loss over the clean token sequence.
Baseline LM – A conventional left‑to‑right language model trained on the same text corpus and vocabulary.
Decoding Strategies –
- DSR (the original “denoising speech recognition” method) that feeds the single 1‑best ASR hypothesis into the DLM.
- DLM‑sum (proposed here) that aggregates N‑best or lattice hypotheses, weighting them before passing to the DLM.
Evaluation – Word error rate (WER) is measured on standard test sets while varying total training compute (GPU‑hours) and the size of the TTS‑generated synthetic data used for pre‑training.

Results & Findings

Setting	LM WER	DLM (DSR) WER	DLM‑sum WER
Low compute (≈ 50 GPU‑h)	9.8 %	10.2 %	10.0 %
Mid compute (≈ 200 GPU‑h)	9.2 %	8.9 %	8.5 %
High compute (≈ 800 GPU‑h)	8.7 %	8.1 %	7.7 %

Compute tipping point: DLMs start to pull ahead after ~150 GPU‑hours of training.
Scaling behavior: DLM gains increase with longer training, while LM performance plateaus earlier.
Vocabulary effect: With subword units the absolute WER reduction is ~0.5 % compared to the ~1.5 % reported for character‑based models.
DLM‑sum advantage: Leveraging multiple hypotheses yields a consistent 0.3–0.5 % absolute WER improvement over DSR.

Practical Implications

Deployable improvement: For production ASR pipelines that can afford longer model training (e.g., cloud‑based services), swapping a traditional LM for a DLM can shave off a few percent WER, directly translating to better user experience in voice assistants, transcription services, and call‑center analytics.
Better use of ASR uncertainty: DLM‑sum demonstrates that feeding richer hypothesis information (N‑best lists or lattices) into the language model is more effective than the classic 1‑best approach, encouraging developers to expose this richer data downstream.
Scalable training recipes: The released pipeline includes scripts for data augmentation and synthetic TTS pre‑training, making it easier for teams to experiment without reinventing the wheel.
Hardware budgeting: The identified compute tipping point helps product managers decide whether the extra GPU budget is justified for a given accuracy target.
Compatibility with existing stacks: Because the DLM operates on the same subword token stream as conventional LMs, it can be dropped into existing decoding graphs (e.g., Kaldi, ESPnet, or Hugging Face pipelines) with minimal engineering effort.

Limitations & Future Work

Vocabulary dependence – Gains shrink when moving from character to subword vocabularies, suggesting that further research is needed to close the gap.
Compute‑intensive – The advantage only appears after substantial training time, which may be prohibitive for smaller teams or on‑device scenarios.
Synthetic data quality – The study relies on TTS‑generated data; real‑world noisy transcripts could behave differently.
Future directions proposed by the authors include: exploring more efficient denoising objectives (e.g., contrastive losses), integrating lattice‑level features directly into the DLM, and extending the analysis to multilingual or code‑switching settings.

Authors

Dorian Koch
Albert Zeyer
Nick Rossenbach
Ralf Schlüter
Hermann Ney

Paper Information

arXiv ID: 2512.13576v1
Categories: cs.NE
Published: December 15, 2025
PDF: Download PDF

[Paper] Reproducing and Dissecting Denoising Language Models for Speech Recognition

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Spatia: Video Generation with Updatable Spatial Memory

[Paper] In Pursuit of Pixel Supervision for Visual Pre-training

[Paper] DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models

[Paper] Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants