[Paper] Text-Utilization for Encoder-dominated Speech Recognition Models

Published: 5 days ago (April 29, 2026 at 06:28 AM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.26514v1

Overview

The paper “Text-Utilization for Encoder‑dominated Speech Recognition Models” tackles a practical problem many speech‑tech teams face: how to make the most of abundant text‑only data when building fast, encoder‑heavy ASR systems. By re‑thinking the balance between encoder size and decoder complexity, the authors show that you can boost accuracy while keeping inference speed high—an attractive win for real‑time applications.

Key Contributions

Systematic comparison of text‑only integration techniques (modality matching, dynamic down‑sampling, random duration modeling).
Evidence that a larger encoder + smaller decoder can match or beat traditional encoder‑decoder ratios on LibriSpeech, reducing latency without sacrificing WER.
Demonstration that simple “random duration” models outperform more elaborate schemes, simplifying the training pipeline.
Open‑source release of code and reproducible recipes, enabling immediate experimentation.

Methodology

Model Architecture

The authors focus on encoder‑dominated end‑to‑end ASR models (e.g., Conformer or Transformer encoders) where the decoder is deliberately lightweight.

Text‑Only Data Integration

Three main strategies are evaluated:

Modality Matching – Aligning the distribution of text embeddings with acoustic embeddings via an auxiliary loss.
Dynamic Down‑sampling – Learning to compress encoder outputs to a “text‑level” sequence length, making it easier to fuse with pure‑text inputs.
Random Duration Modeling – Randomly assigning durations to text tokens during training, effectively teaching the encoder to handle variable‑length inputs without a dedicated duration predictor.

Training Regime

A two‑stage process:

Pre‑train the encoder on paired audio‑text data.
Fine‑tune with mixed batches of paired and text‑only examples using the chosen integration technique.

Evaluation

Experiments are run on the LibriSpeech 960‑hour corpus, reporting word error rate (WER) on both clean and other test sets, as well as inference speed (real‑time factor).

Results & Findings

Model Variant	Encoder Size	Decoder Size	WER (clean)	WER (other)	Real‑Time Factor
Baseline (balanced)	Medium	Medium	3.1 %	7.8 %	0.45
Larger Encoder / Small Decoder (random duration)	Large	Small	2.8 %	7.2 %	0.38
Larger Encoder / Small Decoder (modality matching)	Large	Small	3.0 %	7.5 %	0.40
Larger Encoder / Small Decoder (dynamic down‑sampling)	Large	Small	2.9 %	7.4 %	0.39

Random duration modeling consistently yields the lowest WER, beating the more complex dynamic down‑sampling approach.
The large‑encoder/small‑decoder configuration matches or surpasses the baseline while cutting inference time by ~15 %.
Adding text‑only data via any of the three methods improves performance over a purely supervised baseline, confirming the value of leveraging abundant text corpora.

Practical Implications

Faster Real‑Time ASR – By shifting capacity to the encoder, models can be deployed on edge devices (phones, embedded boards) with limited compute for autoregressive decoding.
Simplified Training Pipelines – Random duration modeling requires no extra duration predictor or alignment model, meaning fewer moving parts and easier debugging.
Better Utilization of Text Corpora – Companies with massive text logs (e.g., chat transcripts, subtitles) can now inject that data into their ASR models without costly audio labeling.
Scalable Architecture Design – The findings encourage a design pattern: “big encoder, lean decoder,” which aligns well with modern hardware accelerators that excel at parallel sequence processing.

Limitations & Future Work

Dataset Scope – Experiments are limited to LibriSpeech; performance on noisy, far‑field, or multilingual data remains untested.
Decoder Expressiveness – While a small decoder speeds up inference, it may struggle with highly contextual language modeling (e.g., code‑switching, domain‑specific jargon).
Text‑Only Integration Overhead – The paper does not quantify the extra memory cost of storing large text‑only batches during fine‑tuning.
Future Directions – Extending the approach to streaming ASR, exploring multilingual text‑only pre‑training, and investigating hybrid encoder‑decoder scaling laws across different model families.

Authors

Albert Zeyer
Tim Posielek
Ralf Schlüter
Hermann Ney

Paper Information

arXiv ID: 2604.26514v1
Categories: cs.CL, cs.AI, cs.NE
Published: April 29, 2026
PDF: Download PDF