[Paper] Speak in Context: Multilingual ASR with Speech Context Alignment via Contrastive Learning
Source: arXiv - 2603.06505v1
Overview
The paper presents Speak in Context, a multilingual automatic‑speech‑recognition (ASR) system that can handle dozens of languages and accents while leveraging contextual cues such as dialogue history or domain‑specific keywords. By marrying a frozen pretrained speech encoder with a decoder‑only language model through a tiny projection layer and a contrastive‑learning alignment loss, the authors achieve a modular, plug‑and‑play architecture that consistently improves transcription quality—by more than 5 % on a large, real‑world conversational dataset.
Key Contributions
- Multilingual, context‑aware ASR framework that works across 11 languages and 5 English dialects without retraining the entire backbone models.
- Modular design: a frozen speech encoder + decoder‑only LM + lightweight projection, preserving the benefits of large pretrained models while keeping compute low.
- Structured context prompts (dialogue turns, biasing words, etc.) that can be injected at inference time to steer the transcription.
- Contrastive learning objective that aligns speech embeddings with contextual embeddings in a shared space, providing a principled cross‑modal interaction.
- Extensive real‑world evaluation on >1,500 h of conversational speech, demonstrating consistent gains across languages and context types.
Methodology
-
Backbone components
- Speech encoder: a pretrained, frozen model (e.g., wav2vec‑2.0) that converts raw audio into a sequence of high‑dimensional embeddings.
- Language model: a decoder‑only transformer (e.g., GPT‑Neo) that generates token streams from textual prompts.
-
Projection module
- A small linear layer (plus optional layer‑norm) maps the speech encoder’s output into the LM’s embedding space, enabling the two modules to talk to each other without fine‑tuning either backbone.
-
Context representation
- Context is tokenized just like any other text (dialogue history, biasing words, task instructions) and fed to the LM as a prompt before the audio‑driven tokens.
-
Contrastive alignment loss
- For each training step, the model samples a positive pair (speech embedding ↔ its true context) and several negative pairs (speech ↔ mismatched contexts).
- A contrastive loss (InfoNCE) pushes the positive pair’s cosine similarity higher while pulling negatives apart, shaping a joint embedding space where speech and its relevant context are close.
-
Training regime
- The speech encoder and LM remain frozen; only the projection layer and the contrastive loss head are updated.
- Standard CTC or cross‑entropy ASR loss is still applied to the transcription output, so the system learns both accurate speech‑to‑text mapping and cross‑modal alignment.
Results & Findings
| Metric | Baseline (no context) | + Structured Context | + Contrastive Alignment |
|---|---|---|---|
| Average WER reduction (11 languages) | — | ≈ 3 % absolute | ≈ 5 % absolute |
| Best‑case language (e.g., Mandarin) | 12.8 % | 10.2 % | 9.5 % |
| English dialects (5) | 7.4 % | 5.9 % | 5.2 % |
- Context matters: Adding dialogue history or biasing words consistently lowered word‑error‑rate (WER) across all languages.
- Contrastive alignment adds value: The extra loss gave an additional ~2 % absolute WER gain on top of raw context prompts, confirming that a shared embedding space improves the model’s ability to “listen” for relevant cues.
- Modularity works: Freezing the large pretrained encoders saved GPU memory and training time (≈ 30 % less compute) while still achieving state‑of‑the‑art multilingual performance.
Practical Implications
- Plug‑and‑play multilingual ASR: Companies can adopt the projection‑only fine‑tuning step to extend existing speech‑to‑text services to new languages or dialects without costly retraining of massive models.
- Dynamic biasing for domain‑specific vocabularies: By feeding biasing words (e.g., product names, medical terminology) as prompts, developers can improve recognition of rare or out‑of‑vocabulary terms on the fly.
- Conversational agents & call‑center analytics: The ability to ingest prior dialogue turns as context enables more accurate transcription in multi‑turn interactions, reducing downstream NLP errors.
- Resource‑efficient deployment: Since the heavy encoders stay frozen, inference can be split across devices (e.g., edge‑device speech encoder, cloud‑hosted LM), opening up low‑latency, privacy‑preserving deployments.
Limitations & Future Work
- Context length: The current prompt handling is limited by the LM’s maximum context window (≈ 2 k tokens), which may truncate very long conversations.
- Contrastive loss scaling: The alignment benefit plateaus after a certain number of negative samples; more sophisticated sampling or memory‑bank techniques could yield further gains.
- Language coverage: While 11 languages and several English dialects were tested, low‑resource languages with scarce pre‑trained encoders remain an open challenge.
- Real‑time latency: Adding the projection and contrastive alignment introduces a modest overhead; future work could explore quantization or distillation to meet strict real‑time requirements.
Speak in Context demonstrates that a lightweight, contrastively aligned bridge between speech and language models can unlock multilingual, context‑aware ASR without the heavyweight cost of end‑to‑end retraining—an approach that many developers can adopt today to build smarter voice‑enabled products.
Authors
- Yuchen Zhang
- Haralambos Mouratidis
- Ravi Shekhar
Paper Information
- arXiv ID: 2603.06505v1
- Categories: cs.CL
- Published: March 6, 2026
- PDF: Download PDF