[Paper] The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR$rightarrow$LLM Pipelines?

Published: 3 days ago (February 19, 2026 at 01:22 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.17598v1

Overview

The paper investigates whether modern speech‑enabled large language models (LLMs) are truly “end‑to‑end” or if they are just fancy versions of the classic ASR → LLM pipeline (e.g., Whisper transcription followed by a text‑only LLM). By carefully matching the language‑model backbone across speech‑LLM and cascade setups, the authors show that for three out of four examined systems the behavior is statistically indistinguishable from a simple cascade, while one model (Qwen2‑Audio) breaks the pattern.

Key Contributions

Matched‑backbone evaluation: First systematic comparison that holds the LLM component constant while swapping the speech front‑end (speech‑LLM vs. Whisper → LLM cascade).
Empirical equivalence evidence:
- Ultravox’s outputs achieve a Cohen’s κ of 0.93 with its Whisper → LLM counterpart.
- Logit‑lens probing uncovers literal text tokens surfacing in hidden layers of the speech‑LLM.
- LEACE concept‑erasure experiments demonstrate that removing the emergent text representation collapses task accuracy to near‑zero.
Architecture‑dependence: Qwen2‑Audio diverges from cascade behavior, proving that equivalence is not universal across all speech‑LLM designs.
Noise robustness analysis: Under noisy conditions (down to 0 dB SNR) the speech‑LLM’s advantage disappears and can reverse by up to 7.6 %, making cascades more reliable in real‑world audio.

Methodology

Model selection – Four publicly available speech‑LLMs (including Ultravox and Qwen2‑Audio) were paired with Whisper as the ASR front‑end. The same text‑only LLM backbone (e.g., Llama‑2, Mistral) was used for both the speech‑LLM and the cascade, ensuring a fair “apples‑to‑apples” comparison.
Task suite – Six downstream tasks that can be solved purely from a transcript (e.g., question answering, summarization, sentiment analysis).
Metrics – Agreement measured with Cohen’s κ, task‑specific accuracy/F1, and probing tools:
- Logit lens: visualizes token probabilities inside hidden states to see if text tokens emerge.
- LEACE (Linear Erasure of Concept Embeddings): forces the model to forget the discovered text concept and measures performance drop.
Noise experiments – Audio inputs were corrupted with additive white noise at various signal‑to‑noise ratios (SNRs) to test robustness.

Results & Findings

Model	Cascade Equivalence (κ)	Text‑emergence (logit lens)	LEACE impact	Noise‑induced Δ (max)
Ultravox	0.93 (statistically indistinguishable)	Clear text token peaks in middle layers	Accuracy → ~0 % after erasure	–7.6 % at 0 dB (cascade wins)
Other 2 speech‑LLMs	>0.85, similar pattern	Text tokens visible	Same collapse effect	Similar degradation
Qwen2‑Audio	κ ≈ 0.45 (significant divergence)	Weak/absent text signatures	Minimal effect	More resilient to noise

Takeaway: For most current speech‑LLMs, the “speech‑to‑text” step is still the dominant computation; the model essentially transcribes internally before feeding the text to its language core. Only Qwen2‑Audio shows a genuine end‑to‑end behavior, hinting that architectural tweaks (e.g., multimodal encoders, joint training) can break the cascade equivalence.

Practical Implications

Cost & latency: Deploying a speech‑LLM that behaves like a cascade offers no performance gain but incurs higher GPU memory and inference time compared to a separate Whisper + LLM stack. Teams can stick with the cheaper, well‑optimized cascade for most applications (voice assistants, transcription‑augmented chatbots).
Debugging & interpretability: Knowing that text representations are explicit inside the model means developers can apply existing ASR debugging tools (e.g., alignment visualizers) to speech‑LLMs, simplifying error analysis.
Noise handling: Since cascades outperform speech‑LLMs under severe noise, production pipelines that must operate in noisy environments (call‑center analytics, in‑car assistants) should retain a dedicated ASR front‑end with proven noise‑robustness.
Model selection: If a truly end‑to‑end advantage (e.g., leveraging prosody or speaker cues) is required, Qwen2‑Audio or future architectures that break the equivalence should be preferred.

Limitations & Future Work

Task scope: The study only covers tasks solvable from a transcript; it does not address scenarios where acoustic cues (tone, emphasis) matter (e.g., emotion detection, speaker intent).
Model diversity: Only four speech‑LLMs were examined; newer or proprietary systems might behave differently.
Noise types: Experiments used synthetic white noise; real‑world distortions (reverberation, background speech) could yield different patterns.
Future directions:
- Extend probing to multimodal concepts (prosody, speaker identity).
- Explore training regimes that explicitly discourage implicit transcription, encouraging richer acoustic utilization.
- Benchmark a broader set of noise conditions and real‑world datasets to validate robustness claims.

Authors

Jayadev Billa

Paper Information

arXiv ID: 2602.17598v1
Categories: cs.CL, cs.AI, eess.AS
Published: February 19, 2026
PDF: Download PDF

[Paper] The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR$rightarrow$LLM Pipelines?

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Sink-Aware Pruning for Diffusion Language Models

[Paper] CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts

[Paper] Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting

[Paper] KLong: Training LLM Agent for Extremely Long-horizon Tasks