[Paper] The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR$rightarrow$LLM Pipelines?

Published: (February 19, 2026 at 01:22 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.17598v1

Overview

The paper investigates whether modern speech‑enabled large language models (LLMs) are truly “end‑to‑end” or if they are just fancy versions of the classic ASR → LLM pipeline (e.g., Whisper transcription followed by a text‑only LLM). By carefully matching the language‑model backbone across speech‑LLM and cascade setups, the authors show that for three out of four examined systems the behavior is statistically indistinguishable from a simple cascade, while one model (Qwen2‑Audio) breaks the pattern.

Key Contributions

  • Matched‑backbone evaluation: First systematic comparison that holds the LLM component constant while swapping the speech front‑end (speech‑LLM vs. Whisper → LLM cascade).
  • Empirical equivalence evidence:
    • Ultravox’s outputs achieve a Cohen’s κ of 0.93 with its Whisper → LLM counterpart.
    • Logit‑lens probing uncovers literal text tokens surfacing in hidden layers of the speech‑LLM.
    • LEACE concept‑erasure experiments demonstrate that removing the emergent text representation collapses task accuracy to near‑zero.
  • Architecture‑dependence: Qwen2‑Audio diverges from cascade behavior, proving that equivalence is not universal across all speech‑LLM designs.
  • Noise robustness analysis: Under noisy conditions (down to 0 dB SNR) the speech‑LLM’s advantage disappears and can reverse by up to 7.6 %, making cascades more reliable in real‑world audio.

Methodology

  1. Model selection – Four publicly available speech‑LLMs (including Ultravox and Qwen2‑Audio) were paired with Whisper as the ASR front‑end. The same text‑only LLM backbone (e.g., Llama‑2, Mistral) was used for both the speech‑LLM and the cascade, ensuring a fair “apples‑to‑apples” comparison.
  2. Task suite – Six downstream tasks that can be solved purely from a transcript (e.g., question answering, summarization, sentiment analysis).
  3. Metrics – Agreement measured with Cohen’s κ, task‑specific accuracy/F1, and probing tools:
    • Logit lens: visualizes token probabilities inside hidden states to see if text tokens emerge.
    • LEACE (Linear Erasure of Concept Embeddings): forces the model to forget the discovered text concept and measures performance drop.
  4. Noise experiments – Audio inputs were corrupted with additive white noise at various signal‑to‑noise ratios (SNRs) to test robustness.

Results & Findings

ModelCascade Equivalence (κ)Text‑emergence (logit lens)LEACE impactNoise‑induced Δ (max)
Ultravox0.93 (statistically indistinguishable)Clear text token peaks in middle layersAccuracy → ~0 % after erasure–7.6 % at 0 dB (cascade wins)
Other 2 speech‑LLMs>0.85, similar patternText tokens visibleSame collapse effectSimilar degradation
Qwen2‑Audioκ ≈ 0.45 (significant divergence)Weak/absent text signaturesMinimal effectMore resilient to noise

Takeaway: For most current speech‑LLMs, the “speech‑to‑text” step is still the dominant computation; the model essentially transcribes internally before feeding the text to its language core. Only Qwen2‑Audio shows a genuine end‑to‑end behavior, hinting that architectural tweaks (e.g., multimodal encoders, joint training) can break the cascade equivalence.

Practical Implications

  • Cost & latency: Deploying a speech‑LLM that behaves like a cascade offers no performance gain but incurs higher GPU memory and inference time compared to a separate Whisper + LLM stack. Teams can stick with the cheaper, well‑optimized cascade for most applications (voice assistants, transcription‑augmented chatbots).
  • Debugging & interpretability: Knowing that text representations are explicit inside the model means developers can apply existing ASR debugging tools (e.g., alignment visualizers) to speech‑LLMs, simplifying error analysis.
  • Noise handling: Since cascades outperform speech‑LLMs under severe noise, production pipelines that must operate in noisy environments (call‑center analytics, in‑car assistants) should retain a dedicated ASR front‑end with proven noise‑robustness.
  • Model selection: If a truly end‑to‑end advantage (e.g., leveraging prosody or speaker cues) is required, Qwen2‑Audio or future architectures that break the equivalence should be preferred.

Limitations & Future Work

  • Task scope: The study only covers tasks solvable from a transcript; it does not address scenarios where acoustic cues (tone, emphasis) matter (e.g., emotion detection, speaker intent).
  • Model diversity: Only four speech‑LLMs were examined; newer or proprietary systems might behave differently.
  • Noise types: Experiments used synthetic white noise; real‑world distortions (reverberation, background speech) could yield different patterns.
  • Future directions:
    • Extend probing to multimodal concepts (prosody, speaker identity).
    • Explore training regimes that explicitly discourage implicit transcription, encouraging richer acoustic utilization.
    • Benchmark a broader set of noise conditions and real‑world datasets to validate robustness claims.

Authors

  • Jayadev Billa

Paper Information

  • arXiv ID: 2602.17598v1
  • Categories: cs.CL, cs.AI, eess.AS
  • Published: February 19, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »