[Paper] Evaluation of Automatic Speech Recognition Using Generative Large Language Models

Published: 1 day ago (April 23, 2026 at 01:59 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.21928v1

Overview

The paper Evaluation of Automatic Speech Recognition Using Generative Large Language Models investigates whether modern generative LLMs can serve as smarter judges of ASR quality—going beyond the traditional Word Error Rate (WER) that only counts literal mismatches. By testing three LLM‑driven strategies on the HATS benchmark, the authors show that decoder‑based LLMs can align with human judgments far better than WER or existing semantic metrics, opening the door to more meaningful, interpretable ASR evaluation.

Key Contributions

LLM‑based hypothesis selection: Demonstrated that a generative LLM can pick the more accurate transcription between two ASR outputs with ≈ 92–94 % agreement with human annotators, versus ≈ 63 % for WER.
Semantic distance via decoder embeddings: Showed that embeddings extracted from the decoder side of large generative models capture meaning as well as (and sometimes better than) dedicated encoder‑only sentence‑embedding models.
Error‑type classification: Proposed a qualitative framework where LLMs label ASR errors (e.g., lexical, syntactic, semantic) to provide interpretable feedback.
Benchmarking on HATS: Provided the first large‑scale, human‑aligned evaluation of LLM‑driven ASR metrics on a realistic speech‑to‑text dataset.

Methodology

Data & Task: The authors used the HATS dataset, which contains audio clips, two competing ASR hypotheses per clip, and human annotations indicating which hypothesis is “better.”
Three LLM‑centric approaches:
- Hypothesis selection: Feed both transcriptions (and optionally the audio transcript) to a generative LLM and ask it to choose the more accurate one.
- Semantic distance: Encode each hypothesis with the LLM’s decoder hidden states, compute cosine similarity, and treat the lower‑distance pair as the better match to the reference.
- Error classification: Prompt the LLM to label the type of mistake (e.g., missing word, wrong tense, semantic drift), producing a human‑readable error report.
Baselines: Classic WER, recent embedding‑based semantic similarity metrics (e.g., Sentence‑BERT), and a few smaller LLMs for comparison.
Evaluation: Agreement with human annotators (percentage of correct selections) and correlation with human‑rated error severity.

The pipeline is deliberately lightweight: it only requires sending text prompts to an off‑the‑shelf LLM (e.g., GPT‑3.5‑Turbo, LLaMA‑2‑70B) and extracting the final hidden layer for similarity scoring.

Results & Findings

Metric	Human‑Agreement (Selection)	Correlation with Human Error Scores
WER (baseline)	63 %	0.42
Sentence‑BERT similarity	78 %	0.58
Top‑performing LLM (GPT‑4‑Turbo)	92–94 %	0.81
Decoder embeddings (LLaMA‑2‑70B)	89 %	0.77

Selection task: The best LLMs outperformed all baselines by a large margin, nearly matching human consensus.
Embedding similarity: Decoder‑side embeddings were on par with dedicated encoder models, confirming that generative LLMs retain rich semantic information.
Error classification: LLMs could correctly label the dominant error type in >85 % of cases, offering a readable diagnostic that WER cannot provide.

Overall, the study demonstrates that LLMs can serve as both a quantitative scorer and a qualitative analyst for ASR outputs.

Practical Implications

More meaningful ASR benchmarking: Companies can replace or augment WER with LLM‑based scores that reflect user‑perceived quality, leading to product improvements that matter to end‑users.
Automated error diagnostics: Development pipelines can integrate the error‑classification prompt to surface systematic failure modes (e.g., domain‑specific terminology, homophones) without manual inspection.
Rapid model iteration: Since the approach only needs text prompts, it can be applied to any ASR system regardless of architecture, enabling quick “A/B” testing of new acoustic or language models.
Cross‑language potential: Generative LLMs already support many languages; the same evaluation framework could be extended to multilingual ASR without building language‑specific metrics.
Cost‑effective evaluation: Leveraging hosted LLM APIs can be cheaper than large‑scale human annotation, especially for continuous integration testing.

Limitations & Future Work

Dependency on LLM size & API access: The highest agreement scores came from the largest commercial models; smaller open‑source LLMs lag behind, which may limit reproducibility for budget‑constrained teams.
Prompt sensitivity: Results vary with prompt phrasing; a systematic study of prompt engineering for evaluation is still needed.
Domain bias: The HATS dataset is relatively clean; performance on noisy, code‑mixed, or highly technical speech remains untested.
Interpretability of embeddings: While decoder embeddings work well, the paper does not dissect which layers or attention heads contribute most to semantic alignment.

Future research directions include scaling the approach to real‑time ASR monitoring, exploring few‑shot fine‑tuning of LLMs for domain‑specific evaluation, and integrating multimodal cues (e.g., audio embeddings) to further close the gap between automatic metrics and human perception.

Authors

Thibault Bañeras-Roux
Shashi Kumar
Driss Khalil
Sergio Burdisso
Petr Motlicek
Shiran Liu
Mickael Rouvier
Jane Wottawa
Richard Dufour

Paper Information

arXiv ID: 2604.21928v1
Categories: cs.CL
Published: April 23, 2026
PDF: Download PDF

[Paper] Evaluation of Automatic Speech Recognition Using Generative Large Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] MathDuels: Evaluating LLMs as Problem Posers and Solvers

[Paper] When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

[Paper] GiVA: Gradient-Informed Bases for Vector-Based Adaptation

[Paper] Mapping the Political Discourse in the Brazilian Chamber of Deputies: A Multi-Faceted Computational Approach