[Paper] PROFASR-BENCH: A Benchmark for Context-Conditioned ASR in High-Stakes Professional Speech

Published: 3 weeks ago (December 29, 2025 at 01:43 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.23686v1

Overview

Deepak Babu Piskala’s new paper introduces ProfASR‑Bench, a benchmark designed to evaluate automatic speech‑recognition (ASR) systems in high‑stakes professional domains such as finance, medicine, law, and tech. By pairing each audio clip with a short textual prompt that describes the speaker’s profile or the domain context, the benchmark makes it possible to measure how well modern ASR models actually use side‑information that is often available in real‑world deployments.

Key Contributions

Domain‑specific benchmark: ~10 k professionally‑styled utterances covering finance, medical, legal, and technology vocabularies, each annotated with entities (e.g., drug names, ticker symbols).
Context ladder: Four prompt levels – no‑context, profile only, domain + profile, and oracle (perfect transcription of the prompt) – plus an adversarial condition to probe robustness.
Entity‑aware evaluation: In addition to classic WER, the suite reports entity error rate (EER) and confidence‑interval‑backed slice metrics (accent, gender).
Reference implementations: Baselines using Whisper (encoder‑decoder ASR) and Qwen‑Omni (audio‑language model) across all prompt conditions.
Open‑source release: Dataset on Hugging Face and evaluation code on GitHub, enabling reproducible comparisons of context‑fusion strategies.

Methodology

Data collection – Professional speakers read scripts that embed dense, domain‑specific terminology. Recordings are balanced across accents, genders, and speaking styles.
Prompt design – For each utterance, a short natural‑language cue is generated (e.g., “You are a cardiologist discussing patient‑specific medication”). The cue can be omitted, partially provided, or replaced with an “oracle” version that exactly mirrors the target content.
Model evaluation – Two representative ASR families are run under each prompt condition. The outputs are scored with:
- WER – overall transcription accuracy.
- EER – proportion of critical entities (tickers, drug codes, legal citations) that are mis‑recognized.
- Slice metrics – WER/EER broken down by speaker accent and gender, with bootstrap confidence intervals.
Analysis of the “context‑utilization gap” – The authors compare performance differences between prompt levels to quantify how much side information the models actually exploit.

Results & Findings

Prompt condition	Whisper WER ↓	Qwen‑Omni WER ↓	Entity Error Rate (EER)
No‑context	12.4 %	10.8 %	7.9 %
Profile only	12.2 %	10.7 %	7.7 %
Domain + profile	12.1 %	10.6 %	7.6 %
Oracle	11.9 %	10.5 %	7.5 %
Adversarial	12.5 %	11.0 %	8.2 %

Minimal impact of prompts – Even the perfect oracle prompt improves average WER by less than 0.5 % absolute, and EER drops only marginally.
Adversarial prompts are not catastrophic – Injecting misleading context does not consistently degrade performance, suggesting current models ignore the prompt rather than being misled.
Consistent “context‑utilization gap” (CUG) – Across both model families, the gap between no‑context and oracle performance is tiny, indicating that the architectures are nominally promptable but rarely leverage the extra information.

Slice‑wise analysis reveals slightly higher errors for non‑native accents, but the CUG remains uniform across these slices.

Practical Implications

Deployments can’t rely on simple prompts – Adding a short “speaker profile” or “domain cue” to an API call will not meaningfully boost transcription quality for critical entities.
Need for explicit fusion mechanisms – Engineers building ASR pipelines for finance or healthcare should consider tighter integration of domain knowledge (e.g., biasing language models with custom vocabularies or using shallow‑fusion with a domain LM) rather than just passing a textual prompt.
Benchmark as a testing harness – ProfASR‑Bench gives product teams a ready‑made suite to stress‑test their models for entity fidelity, a key compliance requirement in regulated sectors.
Confidence‑interval reporting – The slice‑aware metrics help quantify risk for specific user groups (e.g., non‑native speakers), supporting more transparent SLA definitions.

In short, the paper warns that “prompt‑able” ASR is still a buzzword; real‑world gains demand architectural changes.

Limitations & Future Work

Scope of domains – Only four professional sectors are covered; other high‑stakes fields (e.g., aviation, defense) remain untested.
Prompt richness – The prompts are short and templated; richer contextual cues (full meeting minutes, knowledge‑graph embeddings) might show larger effects.
Model diversity – Baselines are limited to Whisper and Qwen‑Omni; newer multimodal or retrieval‑augmented ASR systems could behave differently.
Adversarial design – The adversarial prompts are synthetically generated and may not capture sophisticated real‑world misinformation attacks.

Future work suggested includes extending the benchmark to multilingual professional speech, exploring retrieval‑augmented decoding, and measuring downstream task impact (e.g., automated compliance checking).

Authors

Deepak Babu Piskala

Paper Information

arXiv ID: 2512.23686v1
Categories: cs.CL, cs.SD
Published: December 29, 2025
PDF: Download PDF

[Paper] PROFASR-BENCH: A Benchmark for Context-Conditioned ASR in High-Stakes Professional Speech

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] How Long Is a Piece of String? A Brief Empirical Analysis of Tokenizers

[Paper] Do explanations generalize across large reasoning models?

[Paper] Building Production-Ready Probes For Gemini

[Paper] The Poisoned Apple Effect: Strategic Manipulation of Mediated Markets via Technology Expansion of AI Agents