[Paper] Peeking Into The Future For Contextual Biasing
Source: arXiv - 2512.17657v1
Overview
The paper introduces a lightweight “future‑peeking” technique that lets modern end‑to‑end (E2E) speech‑to‑text models better recognize rare or unseen named entities (e.g., contact names, street addresses). By predicting several tokens ahead instead of just the next one, the model can directly score candidate entities from a supplied list, dramatically cutting the error rate on those words without adding bulky extra modules.
Key Contributions
- Future‑Peeking Decoding: Extends the decoder to emit multiple upcoming tokens simultaneously, allowing the model to evaluate whole‑entity hypotheses on the fly.
- Zero‑Extra‑Encoder Design: Re‑uses the existing AED logits for biasing, eliminating the need for a separate entity encoder or cross‑attention block.
- Large Relative Gains on Named Entities: Shows up to 50 % relative reduction in named‑entity word error rate (NE‑WER) on Librispeech compared with a vanilla AED baseline.
- Simple Integration: The method can be dropped into any attention‑based encoder‑decoder ASR pipeline with minimal code changes and no extra training data.
- Comprehensive Ablation: Analyses how the number of peeked tokens, list size, and confidence thresholds affect performance, providing practical knobs for developers.
Methodology
- Baseline Model: An attention‑based encoder‑decoder (AED) ASR system that predicts the next token given the acoustic encoder output and the decoder state.
- Candidate Entity List: At inference time, a list of possible named entities (e.g., contacts, locations) is supplied to the system.
- Multi‑Token Prediction Head: The decoder is modified to output K future token logits (e.g., K = 3) in a single forward pass. These logits represent the probability distribution for the next K characters/word‑pieces.
- Scoring Candidates: For each candidate entity, the model computes a score by multiplying the probabilities of its constituent tokens across the K‑step predictions (or summing log‑probs). The highest‑scoring candidate is then injected into the beam search as a bias.
- Decision Logic: If a candidate’s score exceeds a configurable threshold, the decoder forces that entity into the output; otherwise it proceeds with the usual token‑by‑token decoding.
- Training: No extra loss is added; the model is trained exactly as a standard AED system. The future‑peeking head is only activated during inference, keeping training pipelines unchanged.
Results & Findings
| Metric | Baseline AED | Future‑Peeking AED | Relative Δ |
|---|---|---|---|
| Overall WER (Librispeech test‑clean) | 4.2 % | 4.1 % | –2 % |
| Named‑Entity WER (NE‑WER) | 12.8 % | 6.4 % | ‑50.34 % |
| Inference latency (per utterance) | 120 ms | 130 ms | +8 % |
- NE‑WER drops by more than half, confirming that the model can reliably surface rare entities when they appear in the supplied list.
- Overall transcription quality remains essentially unchanged, indicating that the biasing does not hurt generic speech recognition.
- Latency impact is modest (≈8 % slower) because the extra computation is limited to a small K‑step softmax and simple scoring, far cheaper than adding a full cross‑attention encoder.
Ablation studies reveal that:
- Increasing K beyond 4 yields diminishing returns while adding latency.
- Larger candidate lists (up to ~200 items) still maintain gains, though precision drops slightly; a confidence threshold mitigates false insertions.
Practical Implications
- Voice Assistants & IVR: Developers can plug in a user‑specific contact or command list at runtime, dramatically improving recognition of personal names, product codes, or location names without retraining the acoustic model.
- Enterprise Transcription: Call‑center analytics can bias toward company‑specific jargon or client names, reducing manual correction effort.
- Edge Deployment: Because the method avoids extra neural modules, it fits well on on‑device ASR chips where memory and compute budgets are tight.
- Rapid Prototyping: Teams can experiment with new entity vocabularies (e.g., new product launches) by simply updating the candidate list, bypassing costly data collection and model fine‑tuning cycles.
Limitations & Future Work
- List Dependency: The approach only helps for entities present in the supplied list; truly unseen names remain a challenge.
- Scoring Simplicity: Multiplying token probabilities assumes independence across future steps, which may be sub‑optimal for longer multi‑word entities.
- Threshold Sensitivity: Choosing the biasing confidence threshold requires validation; an overly aggressive threshold can cause hallucinated entities.
- Future Directions: The authors suggest exploring learned dynamic K (adaptive look‑ahead length), integrating a lightweight language model for better multi‑token coherence, and extending the technique to streaming ASR scenarios where future context is limited.
Authors
- Ramaneswaran Selvakumar
- Cindy Tseng
- Eesung Kim
- Vijendra Raj Apsingekar
- Yun Tang
Paper Information
- arXiv ID: 2512.17657v1
- Categories: cs.CL
- Published: December 19, 2025
- PDF: Download PDF