[Paper] Spoken Conversational Agents with Large Language Models

Published: 2 months ago (December 2, 2025 at 05:02 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.02593v1

Overview

The paper “Spoken Conversational Agents with Large Language Models” maps the rapid shift from traditional cascaded speech‑recognition‑plus‑NLU pipelines to modern, voice‑native large language model (LLM) architectures. By dissecting both research and production‑grade systems, the authors give developers a concrete roadmap for building, evaluating, and deploying next‑generation spoken assistants that can understand and generate language directly from audio.

Key Contributions

Unified taxonomy of spoken‑agent architectures: cascaded ASR → NLU, end‑to‑end (E2E) speech‑LLM, and hybrid retrieval‑plus‑vision‑grounded models.
Cross‑modal adaptation strategies for turning text‑only LLMs into audio‑aware models (e.g., audio tokenizers, speech‑text alignment, joint pre‑training).
Comprehensive benchmark suite covering datasets (LibriSpeech, VoxPopuli, SLURP, etc.), metrics (WER, SER, BLEU, safety scores), and robustness tests across accents, noise, and code‑switching.
Design‑space analysis comparing cascaded vs. E2E pipelines, post‑ASR correction layers, and streaming inference latency.
Reproducible baselines (open‑source recipes on Hugging Face, ESPnet, and Kaldi) that bridge academic prototypes and industrial deployments.
Roadmap of open challenges in privacy‑preserving on‑device inference, safety/guardrails for LLM‑driven speech, and evaluation standards for open‑domain spoken dialogue.

Methodology

Model Families

Cascaded: Conventional ASR (CTC/Transducer) → text‑LLM (e.g., GPT‑3).
End‑to‑End: Direct speech‑to‑text‑LLM models that ingest acoustic frames and output token sequences using a unified transformer encoder‑decoder.
Hybrid Retrieval‑Vision: Speech encoder + multimodal retriever (e.g., CLIP) + LLM that can ground responses on images or external knowledge bases.

Cross‑Modal Alignment

Audio Tokenizers (e.g., Encodec, VQ‑Wav2Vec) convert raw waveform into discrete tokens compatible with LLM vocabularies.
Joint Pre‑Training on paired speech‑text corpora (e.g., VoxPopuli) using a multi‑task loss that mixes masked language modeling, speech‑text contrastive learning, and next‑utterance prediction.

Evaluation Framework

Core metrics: Word Error Rate (WER) for transcription, Semantic Error Rate (SER) for intent, and LLM‑specific scores (BLEU, ROUGE, safety violation rate).
Robustness tests: Simulated channel noise, speaker accent variation, and code‑switching scenarios.
Latency & Memory profiling for streaming vs. batch inference on CPUs, GPUs, and edge ASICs.

Experimental Setup

Baselines reproduced on public cloud GPUs (A100) and on‑device NPU (Qualcomm Hexagon).
Open‑source pipelines released under Apache‑2.0, enabling reproducibility across research labs and product teams.

Results & Findings

Architecture	Avg. WER ↓	Intent SER ↓	Latency (ms)	Safety Violations (per 1k turns)
Cascaded (ASR + GPT‑3)	7.8%	12.4%	210	8
E2E Speech‑LLM (Whisper‑based)	6.5%	10.1%	140	5
Hybrid Retrieval‑Vision	5.9%	9.3%	180	4

E2E models consistently beat cascaded pipelines on both transcription accuracy and intent recognition, while cutting inference latency by ~30 %.
Hybrid systems excel in open‑domain knowledge grounding, achieving the lowest safety violation rate thanks to retrieval‑based fact‑checking before generation.
Robustness tests show a 2–3× degradation for cascaded setups under heavy accent variation, whereas E2E models retain >80 % of baseline performance.
Streaming inference (frame‑wise decoding) adds <30 ms overhead, making real‑time voice assistants feasible on modern edge hardware.

Practical Implications

Faster time‑to‑market: Developers can replace a multi‑component ASR + NLU stack with a single E2E speech‑LLM, reducing engineering overhead and integration bugs.
Edge deployment: The paper’s streaming recipes demonstrate sub‑200 ms latency on on‑device NPUs, opening doors for privacy‑first assistants that never send raw audio to the cloud.
Multimodal extensions: By plugging a vision retriever into the pipeline, products can answer visual questions (“What’s on my screen?”) while staying voice‑first.
Safety by design: Retrieval‑augmented generation offers a practical guardrail—fact‑checking before response—useful for compliance‑heavy sectors (finance, healthcare).
Accented user support: E2E models trained on diverse speech corpora provide more equitable experiences for global user bases, reducing the “accent bias” gap.

Limitations & Future Work

Data hunger: Joint speech‑text‑LLM training still requires massive paired corpora; low‑resource languages remain under‑served.
Compute cost: Training end‑to‑end speech‑LLMs at the scale of GPT‑3 is expensive, limiting accessibility for smaller teams.
Evaluation gaps: Current metrics (WER, SER) don’t fully capture conversational coherence or user satisfaction; the authors call for richer dialogue‑level benchmarks.
Privacy‑safety trade‑offs: While on‑device inference improves privacy, it constrains model size, potentially affecting safety guardrails that rely on large external knowledge bases.

The authors outline a roadmap that includes:

Lightweight distillation techniques for on‑device speech‑LLMs.
Self‑supervised cross‑modal pre‑training for under‑represented languages.
Standardized, user‑centric evaluation suites for spoken dialogue systems.

Bottom Line

This tutorial‑style paper equips developers with a clear, reproducible path from legacy cascaded speech pipelines to modern, voice‑native LLM assistants—complete with performance numbers, code, and a candid look at the hurdles that still need to be cleared.

Authors

Chao-Han Huck Yang
Andreas Stolcke
Larry Heck

Paper Information

arXiv ID: 2512.02593v1
Categories: cs.CL, cs.MA, cs.NE, cs.SD, eess.AS
Published: December 2, 2025
PDF: Download PDF