[Paper] Speak in Context: Multilingual ASR with Speech Context Alignment via Contrastive Learning

Published: 3 days ago (March 6, 2026 at 12:37 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2603.06505v1

Overview

The paper presents Speak in Context, a multilingual automatic‑speech‑recognition (ASR) system that can handle dozens of languages and accents while leveraging contextual cues such as dialogue history or domain‑specific keywords. By marrying a frozen pretrained speech encoder with a decoder‑only language model through a tiny projection layer and a contrastive‑learning alignment loss, the authors achieve a modular, plug‑and‑play architecture that consistently improves transcription quality—by more than 5 % on a large, real‑world conversational dataset.

Key Contributions

Multilingual, context‑aware ASR framework that works across 11 languages and 5 English dialects without retraining the entire backbone models.
Modular design: a frozen speech encoder + decoder‑only LM + lightweight projection, preserving the benefits of large pretrained models while keeping compute low.
Structured context prompts (dialogue turns, biasing words, etc.) that can be injected at inference time to steer the transcription.
Contrastive learning objective that aligns speech embeddings with contextual embeddings in a shared space, providing a principled cross‑modal interaction.
Extensive real‑world evaluation on >1,500 h of conversational speech, demonstrating consistent gains across languages and context types.

Methodology

Backbone components
- Speech encoder: a pretrained, frozen model (e.g., wav2vec‑2.0) that converts raw audio into a sequence of high‑dimensional embeddings.
- Language model: a decoder‑only transformer (e.g., GPT‑Neo) that generates token streams from textual prompts.
Projection module
- A small linear layer (plus optional layer‑norm) maps the speech encoder’s output into the LM’s embedding space, enabling the two modules to talk to each other without fine‑tuning either backbone.
Context representation
- Context is tokenized just like any other text (dialogue history, biasing words, task instructions) and fed to the LM as a prompt before the audio‑driven tokens.
Contrastive alignment loss
- For each training step, the model samples a positive pair (speech embedding ↔ its true context) and several negative pairs (speech ↔ mismatched contexts).
- A contrastive loss (InfoNCE) pushes the positive pair’s cosine similarity higher while pulling negatives apart, shaping a joint embedding space where speech and its relevant context are close.
Training regime
- The speech encoder and LM remain frozen; only the projection layer and the contrastive loss head are updated.
- Standard CTC or cross‑entropy ASR loss is still applied to the transcription output, so the system learns both accurate speech‑to‑text mapping and cross‑modal alignment.

Results & Findings

Metric	Baseline (no context)	+ Structured Context	+ Contrastive Alignment
Average WER reduction (11 languages)	—	≈ 3 % absolute	≈ 5 % absolute
Best‑case language (e.g., Mandarin)	12.8 %	10.2 %	9.5 %
English dialects (5)	7.4 %	5.9 %	5.2 %

Context matters: Adding dialogue history or biasing words consistently lowered word‑error‑rate (WER) across all languages.
Contrastive alignment adds value: The extra loss gave an additional ~2 % absolute WER gain on top of raw context prompts, confirming that a shared embedding space improves the model’s ability to “listen” for relevant cues.
Modularity works: Freezing the large pretrained encoders saved GPU memory and training time (≈ 30 % less compute) while still achieving state‑of‑the‑art multilingual performance.

Practical Implications

Plug‑and‑play multilingual ASR: Companies can adopt the projection‑only fine‑tuning step to extend existing speech‑to‑text services to new languages or dialects without costly retraining of massive models.
Dynamic biasing for domain‑specific vocabularies: By feeding biasing words (e.g., product names, medical terminology) as prompts, developers can improve recognition of rare or out‑of‑vocabulary terms on the fly.
Conversational agents & call‑center analytics: The ability to ingest prior dialogue turns as context enables more accurate transcription in multi‑turn interactions, reducing downstream NLP errors.
Resource‑efficient deployment: Since the heavy encoders stay frozen, inference can be split across devices (e.g., edge‑device speech encoder, cloud‑hosted LM), opening up low‑latency, privacy‑preserving deployments.

Limitations & Future Work

Context length: The current prompt handling is limited by the LM’s maximum context window (≈ 2 k tokens), which may truncate very long conversations.
Contrastive loss scaling: The alignment benefit plateaus after a certain number of negative samples; more sophisticated sampling or memory‑bank techniques could yield further gains.
Language coverage: While 11 languages and several English dialects were tested, low‑resource languages with scarce pre‑trained encoders remain an open challenge.
Real‑time latency: Adding the projection and contrastive alignment introduces a modest overhead; future work could explore quantization or distillation to meet strict real‑time requirements.

Speak in Context demonstrates that a lightweight, contrastively aligned bridge between speech and language models can unlock multilingual, context‑aware ASR without the heavyweight cost of end‑to‑end retraining—an approach that many developers can adopt today to build smarter voice‑enabled products.

Authors

Yuchen Zhang
Haralambos Mouratidis
Ravi Shekhar

Paper Information

arXiv ID: 2603.06505v1
Categories: cs.CL
Published: March 6, 2026
PDF: Download PDF

[Paper] Speak in Context: Multilingual ASR with Speech Context Alignment via Contrastive Learning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] KCLarity at SemEval-2026 Task 6: Encoder and Zero-Shot Approaches to Political Evasion Detection

[Paper] Beyond Rows to Reasoning: Agentic Retrieval for Multimodal Spreadsheet Understanding and Editing

[Paper] COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics

[Paper] NOBLE: Accelerating Transformers with Nonlinear Low-Rank Branches