[Paper] Diffusion Language Models for Speech Recognition
Source: arXiv - 2604.14001v1
Overview
The paper investigates how diffusion language models (DLMs)—a new class of generative models that excel at bidirectional context handling—can be harnessed to boost automatic speech recognition (ASR). By adapting two DLM variants, masked diffusion language models (MDLM) and uniform‑state diffusion models (USDM), the authors show how to rescore ASR hypotheses and even perform joint decoding with a CTC acoustic model, achieving noticeable word‑error‑rate (WER) reductions.
Key Contributions
- Comprehensive recipe for integrating MDLM and USDM into ASR pipelines, covering data preparation, training, and inference.
- Joint‑decoding algorithm that fuses frame‑wise CTC probabilities with label‑wise USDM probabilities at each step, creating hybrid candidates that benefit from both acoustic and language knowledge.
- Extensive empirical evaluation on standard speech corpora demonstrating that both MDLM and USDM outperform conventional n‑gram and Transformer‑based rescoring baselines.
- Open‑source release of all code, model checkpoints, and training scripts, enabling reproducibility and rapid adoption by the community.
Methodology
-
Diffusion Language Modeling
- Diffusion models generate text by gradually “denoising” a noisy token sequence.
- MDLM follows the masked diffusion paradigm: a subset of tokens is masked, and the model learns to reconstruct them, similar to BERT but with a diffusion process.
- USDM adopts a uniform‑state diffusion schedule where every token is treated equally across diffusion steps, simplifying the training dynamics.
-
Rescoring ASR Hypotheses
- An ASR system first produces an N‑best list (or lattice) using a conventional acoustic model (e.g., CTC or hybrid HMM‑DNN).
- Each candidate is scored by the diffusion LM: the model computes the log‑probability of the entire token sequence, which is combined with the acoustic score (typically via a log‑linear interpolation).
-
Joint Decoding with CTC + USDM
- Instead of a two‑stage pipeline (decode → rescore), the authors propose a tight integration: at every decoding step, the CTC’s frame‑wise distribution (p_{\text{CTC}}(t|x)) and the USDM’s label‑wise distribution (p_{\text{USDM}}(y|t)) are multiplied (or summed in log‑space).
- This yields a combined probability that guides beam search, allowing the decoder to generate new hypotheses that were not present in the original N‑best list.
-
Training Details
- Both MDLM and USDM are trained on large text corpora (e.g., LibriSpeech LM data) using the standard diffusion loss.
- The acoustic CTC model is trained separately on paired audio‑text data. No joint training is required, which keeps the approach modular.
Results & Findings
| Model / Setup | WER (dev) | WER (test) |
|---|---|---|
| Baseline CTC (no LM) | 7.8 % | 8.2 % |
| CTC + 4‑gram LM | 6.9 % | 7.3 % |
| CTC + Transformer LM (shallow) | 6.4 % | 6.8 % |
| CTC + MDLM rescoring | 6.1 % | 6.5 % |
| CTC + USDM rescoring | 5.9 % | 6.2 % |
| CTC + USDM joint decoding | 5.5 % | 5.8 % |
- Both diffusion LMs outperform traditional n‑gram and Transformer rescoring by 0.3–0.7 % absolute WER.
- The joint decoding strategy yields the largest gain, confirming that merging acoustic and diffusion‑based language scores at inference time can create better hypotheses than rescoring alone.
- Ablation studies show that the uniform‑state diffusion schedule is more stable during training and requires fewer diffusion steps than the masked variant, while still delivering comparable accuracy.
Practical Implications
- Plug‑and‑play improvement: Developers can add MDLM/USDM rescoring to existing CTC‑based ASR services without retraining the acoustic model, gaining immediate accuracy boosts.
- Real‑time feasibility: USDM’s simpler diffusion schedule translates to faster inference (≈2× speedup over MDLM), making it suitable for low‑latency applications such as voice assistants or transcription services.
- Enhanced robustness: Because diffusion LMs incorporate bidirectional context naturally, they handle noisy or ambiguous utterances better than left‑to‑right autoregressive LMs, reducing error spikes in conversational AI.
- Open‑source toolkit: The released recipes integrate with popular frameworks (ESPnet, Kaldi, PyTorch), lowering the barrier for research labs and startups to experiment with diffusion‑based language modeling.
Limitations & Future Work
- Computational overhead: Even the faster USDM adds noticeable latency compared to lightweight n‑gram rescoring, which may be prohibitive for ultra‑low‑delay devices.
- Memory footprint: Diffusion models require larger GPU memory during inference, especially for long utterances, limiting deployment on edge hardware.
- Domain adaptation: The paper focuses on read speech (LibriSpeech); adapting diffusion LMs to highly domain‑specific vocabularies (e.g., medical dictation) remains an open challenge.
- Joint training: While the current approach keeps acoustic and language models separate, future work could explore end‑to‑end training of CTC + diffusion LM to further tighten the acoustic‑language synergy.
The authors have made their code and pretrained models publicly available, so you can start experimenting with diffusion language models in your own ASR pipelines today.
Authors
- Davyd Naveriani
- Albert Zeyer
- Ralf Schlüter
- Hermann Ney
Paper Information
- arXiv ID: 2604.14001v1
- Categories: cs.CL, cs.AI, cs.LG, cs.NE
- Published: April 15, 2026
- PDF: Download PDF