[Paper] PROFASR-BENCH: A Benchmark for Context-Conditioned ASR in High-Stakes Professional Speech

Published: (December 29, 2025 at 01:43 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.23686v1

Overview

Deepak Babu Piskala’s new paper introduces ProfASR‑Bench, a benchmark designed to evaluate automatic speech‑recognition (ASR) systems in high‑stakes professional domains such as finance, medicine, law, and tech. By pairing each audio clip with a short textual prompt that describes the speaker’s profile or the domain context, the benchmark makes it possible to measure how well modern ASR models actually use side‑information that is often available in real‑world deployments.

Key Contributions

  • Domain‑specific benchmark: ~10 k professionally‑styled utterances covering finance, medical, legal, and technology vocabularies, each annotated with entities (e.g., drug names, ticker symbols).
  • Context ladder: Four prompt levels – no‑context, profile only, domain + profile, and oracle (perfect transcription of the prompt) – plus an adversarial condition to probe robustness.
  • Entity‑aware evaluation: In addition to classic WER, the suite reports entity error rate (EER) and confidence‑interval‑backed slice metrics (accent, gender).
  • Reference implementations: Baselines using Whisper (encoder‑decoder ASR) and Qwen‑Omni (audio‑language model) across all prompt conditions.
  • Open‑source release: Dataset on Hugging Face and evaluation code on GitHub, enabling reproducible comparisons of context‑fusion strategies.

Methodology

  1. Data collection – Professional speakers read scripts that embed dense, domain‑specific terminology. Recordings are balanced across accents, genders, and speaking styles.
  2. Prompt design – For each utterance, a short natural‑language cue is generated (e.g., “You are a cardiologist discussing patient‑specific medication”). The cue can be omitted, partially provided, or replaced with an “oracle” version that exactly mirrors the target content.
  3. Model evaluation – Two representative ASR families are run under each prompt condition. The outputs are scored with:
    • WER – overall transcription accuracy.
    • EER – proportion of critical entities (tickers, drug codes, legal citations) that are mis‑recognized.
    • Slice metrics – WER/EER broken down by speaker accent and gender, with bootstrap confidence intervals.
  4. Analysis of the “context‑utilization gap” – The authors compare performance differences between prompt levels to quantify how much side information the models actually exploit.

Results & Findings

Prompt conditionWhisper WER ↓Qwen‑Omni WER ↓Entity Error Rate (EER)
No‑context12.4 %10.8 %7.9 %
Profile only12.2 %10.7 %7.7 %
Domain + profile12.1 %10.6 %7.6 %
Oracle11.9 %10.5 %7.5 %
Adversarial12.5 %11.0 %8.2 %
  • Minimal impact of prompts – Even the perfect oracle prompt improves average WER by less than 0.5 % absolute, and EER drops only marginally.
  • Adversarial prompts are not catastrophic – Injecting misleading context does not consistently degrade performance, suggesting current models ignore the prompt rather than being misled.
  • Consistent “context‑utilization gap” (CUG) – Across both model families, the gap between no‑context and oracle performance is tiny, indicating that the architectures are nominally promptable but rarely leverage the extra information.

Slice‑wise analysis reveals slightly higher errors for non‑native accents, but the CUG remains uniform across these slices.

Practical Implications

  • Deployments can’t rely on simple prompts – Adding a short “speaker profile” or “domain cue” to an API call will not meaningfully boost transcription quality for critical entities.
  • Need for explicit fusion mechanisms – Engineers building ASR pipelines for finance or healthcare should consider tighter integration of domain knowledge (e.g., biasing language models with custom vocabularies or using shallow‑fusion with a domain LM) rather than just passing a textual prompt.
  • Benchmark as a testing harness – ProfASR‑Bench gives product teams a ready‑made suite to stress‑test their models for entity fidelity, a key compliance requirement in regulated sectors.
  • Confidence‑interval reporting – The slice‑aware metrics help quantify risk for specific user groups (e.g., non‑native speakers), supporting more transparent SLA definitions.

In short, the paper warns that “prompt‑able” ASR is still a buzzword; real‑world gains demand architectural changes.

Limitations & Future Work

  • Scope of domains – Only four professional sectors are covered; other high‑stakes fields (e.g., aviation, defense) remain untested.
  • Prompt richness – The prompts are short and templated; richer contextual cues (full meeting minutes, knowledge‑graph embeddings) might show larger effects.
  • Model diversity – Baselines are limited to Whisper and Qwen‑Omni; newer multimodal or retrieval‑augmented ASR systems could behave differently.
  • Adversarial design – The adversarial prompts are synthetically generated and may not capture sophisticated real‑world misinformation attacks.

Future work suggested includes extending the benchmark to multilingual professional speech, exploring retrieval‑augmented decoding, and measuring downstream task impact (e.g., automated compliance checking).

Authors

  • Deepak Babu Piskala

Paper Information

  • arXiv ID: 2512.23686v1
  • Categories: cs.CL, cs.SD
  • Published: December 29, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »