[Paper] Evaluation of Automatic Speech Recognition Using Generative Large Language Models

Published: (April 23, 2026 at 01:59 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2604.21928v1

Overview

The paper Evaluation of Automatic Speech Recognition Using Generative Large Language Models investigates whether modern generative LLMs can serve as smarter judges of ASR quality—going beyond the traditional Word Error Rate (WER) that only counts literal mismatches. By testing three LLM‑driven strategies on the HATS benchmark, the authors show that decoder‑based LLMs can align with human judgments far better than WER or existing semantic metrics, opening the door to more meaningful, interpretable ASR evaluation.

Key Contributions

  • LLM‑based hypothesis selection: Demonstrated that a generative LLM can pick the more accurate transcription between two ASR outputs with ≈ 92–94 % agreement with human annotators, versus ≈ 63 % for WER.
  • Semantic distance via decoder embeddings: Showed that embeddings extracted from the decoder side of large generative models capture meaning as well as (and sometimes better than) dedicated encoder‑only sentence‑embedding models.
  • Error‑type classification: Proposed a qualitative framework where LLMs label ASR errors (e.g., lexical, syntactic, semantic) to provide interpretable feedback.
  • Benchmarking on HATS: Provided the first large‑scale, human‑aligned evaluation of LLM‑driven ASR metrics on a realistic speech‑to‑text dataset.

Methodology

  1. Data & Task: The authors used the HATS dataset, which contains audio clips, two competing ASR hypotheses per clip, and human annotations indicating which hypothesis is “better.”
  2. Three LLM‑centric approaches:
    • Hypothesis selection: Feed both transcriptions (and optionally the audio transcript) to a generative LLM and ask it to choose the more accurate one.
    • Semantic distance: Encode each hypothesis with the LLM’s decoder hidden states, compute cosine similarity, and treat the lower‑distance pair as the better match to the reference.
    • Error classification: Prompt the LLM to label the type of mistake (e.g., missing word, wrong tense, semantic drift), producing a human‑readable error report.
  3. Baselines: Classic WER, recent embedding‑based semantic similarity metrics (e.g., Sentence‑BERT), and a few smaller LLMs for comparison.
  4. Evaluation: Agreement with human annotators (percentage of correct selections) and correlation with human‑rated error severity.

The pipeline is deliberately lightweight: it only requires sending text prompts to an off‑the‑shelf LLM (e.g., GPT‑3.5‑Turbo, LLaMA‑2‑70B) and extracting the final hidden layer for similarity scoring.

Results & Findings

MetricHuman‑Agreement (Selection)Correlation with Human Error Scores
WER (baseline)63 %0.42
Sentence‑BERT similarity78 %0.58
Top‑performing LLM (GPT‑4‑Turbo)92–94 %0.81
Decoder embeddings (LLaMA‑2‑70B)89 %0.77
  • Selection task: The best LLMs outperformed all baselines by a large margin, nearly matching human consensus.
  • Embedding similarity: Decoder‑side embeddings were on par with dedicated encoder models, confirming that generative LLMs retain rich semantic information.
  • Error classification: LLMs could correctly label the dominant error type in >85 % of cases, offering a readable diagnostic that WER cannot provide.

Overall, the study demonstrates that LLMs can serve as both a quantitative scorer and a qualitative analyst for ASR outputs.

Practical Implications

  • More meaningful ASR benchmarking: Companies can replace or augment WER with LLM‑based scores that reflect user‑perceived quality, leading to product improvements that matter to end‑users.
  • Automated error diagnostics: Development pipelines can integrate the error‑classification prompt to surface systematic failure modes (e.g., domain‑specific terminology, homophones) without manual inspection.
  • Rapid model iteration: Since the approach only needs text prompts, it can be applied to any ASR system regardless of architecture, enabling quick “A/B” testing of new acoustic or language models.
  • Cross‑language potential: Generative LLMs already support many languages; the same evaluation framework could be extended to multilingual ASR without building language‑specific metrics.
  • Cost‑effective evaluation: Leveraging hosted LLM APIs can be cheaper than large‑scale human annotation, especially for continuous integration testing.

Limitations & Future Work

  • Dependency on LLM size & API access: The highest agreement scores came from the largest commercial models; smaller open‑source LLMs lag behind, which may limit reproducibility for budget‑constrained teams.
  • Prompt sensitivity: Results vary with prompt phrasing; a systematic study of prompt engineering for evaluation is still needed.
  • Domain bias: The HATS dataset is relatively clean; performance on noisy, code‑mixed, or highly technical speech remains untested.
  • Interpretability of embeddings: While decoder embeddings work well, the paper does not dissect which layers or attention heads contribute most to semantic alignment.

Future research directions include scaling the approach to real‑time ASR monitoring, exploring few‑shot fine‑tuning of LLMs for domain‑specific evaluation, and integrating multimodal cues (e.g., audio embeddings) to further close the gap between automatic metrics and human perception.

Authors

  • Thibault Bañeras-Roux
  • Shashi Kumar
  • Driss Khalil
  • Sergio Burdisso
  • Petr Motlicek
  • Shiran Liu
  • Mickael Rouvier
  • Jane Wottawa
  • Richard Dufour

Paper Information

  • arXiv ID: 2604.21928v1
  • Categories: cs.CL
  • Published: April 23, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »