[Paper] Beyond Text Following: Repairable Arbitration Reversals in Audio-Language Models
Source: arXiv - 2606.05161v1
Overview
Audio‑language models (ALMs) are designed to answer questions by jointly reasoning over spoken audio and accompanying text. However, when the text and audio give contradictory cues, many ALMs stubbornly follow the text—even when the audio provides clear evidence to the contrary. This paper investigates whether the audio information is truly missing from the model’s internal representation or simply gets “over‑ruled” during the final decision‑making step, and proposes a lightweight fix that lets developers recover the audio‑driven answer without retraining the model.
Key Contributions
- Counterfactual arbitration analysis: Introduces a same‑audio counterfactual that removes only the conflicting text, revealing that 64 % of conflict cases flip their answer preference, indicating that audio evidence is present but suppressed.
- Fine‑grained localization: Uses activation‑patching to pinpoint the reversal to the answer‑position computation stage, and shows a near‑perfect correlation (Spearman ρ = 0.93) between patching impact and candidate‑score differences.
- Gated Audio Counterfactual Logit Correction (GACL): A training‑free decoding rule that interpolates between joint (audio + text) and same‑audio scores, effectively “un‑arbitrating” the decision.
- Strong empirical gains: Under a strict 5 percentage‑point faithfulness‑drop budget, GACL lifts normalized area‑under‑curve (nAUC) by 17.8 pp over the strongest contrastive baseline, and the method transfers to vision‑text arbitration tasks with up to +40.5 pp improvement.
Methodology
- Conflict benchmark creation – The authors build four tasks where the textual prompt deliberately contradicts the audio (e.g., “What color is the car?” with a spoken description saying “red” while the text says “blue”).
- Same‑audio counterfactual – For each example they keep the audio unchanged but replace the conflicting text with a neutral placeholder, then compare the model’s answer distribution to the original joint (audio + text) run.
- Arbitration reversal detection – A sign flip occurs when the joint model prefers the text‑supported answer but the same‑audio run prefers the audio‑supported answer. The proportion of flips quantifies how often audio evidence is present but overridden.
- Activation patching – By swapping intermediate activations from the same‑audio run into the joint run, they isolate which layer(s) cause the reversal. The effect size of each patch is measured against the change in output logits.
- GACL decoding rule – At inference time, the model’s final logits are recomputed as a weighted blend:
[ \text{logit}{\text{GACL}} = \alpha \cdot \text{logit}{\text{joint}} + (1-\alpha) \cdot \text{logit}_{\text{same‑audio}} ]
The gate α is chosen per‑example based on a simple heuristic that respects a pre‑specified “faithfulness‑drop” budget (i.e., how much overall accuracy we’re willing to sacrifice to gain audio fidelity). No model parameters are altered.
Results & Findings
- Arbitration reversal prevalence: 64.1 % of conflict samples across five ALMs (including Whisper‑based and CLIP‑audio variants) flip their answer when the text is neutralized.
- Localization: Patching the answer‑position computation layer alone reproduces >90 % of the reversal effect, confirming that the arbitration happens late in the forward pass.
- GACL performance:
- Audio‑faithfulness: nAUC improves by 17.8 pp compared to the best contrastive baseline while staying within a 5 pp drop in overall task accuracy.
- Cross‑modal transfer: When applied to vision‑text models (e.g., CLIP), GACL yields up to +40.5 pp nAUC gain, demonstrating that the phenomenon is not limited to audio.
- Efficiency: Because GACL is a decoding‑time operation, it adds negligible overhead (< 2 ms per query) and requires no extra training data.
Practical Implications
- More reliable multimodal assistants: Voice‑first AI assistants (e.g., smart speakers) can now trust that the spoken content will be respected even when the UI text or prior context is misleading.
- Debugging and auditing: Activation‑patching offers a diagnostic toolkit for developers to locate arbitration bugs in any multimodal model, facilitating transparent model introspection.
- Zero‑cost improvement pipelines: Teams can integrate GACL as a plug‑in to existing ALM deployments (e.g., Whisper‑based transcription + QA pipelines) without retraining, gaining a measurable boost in audio‑driven correctness.
- Cross‑modal robustness: The same technique can be ported to vision‑language or video‑language systems, helping products that need to reconcile captions, subtitles, or on‑screen text with visual cues.
Limitations & Future Work
- Faithfulness budget trade‑off: GACL deliberately sacrifices a small amount of overall accuracy to improve audio fidelity; the optimal balance may differ across applications and requires manual tuning.
- Scope of conflicts: The study focuses on deliberately crafted contradictions; real‑world ambiguities (e.g., noisy audio, subtle textual nuance) may behave differently.
- Model‑specific gating: The current heuristic for α is simple; learning a more sophisticated, possibly context‑aware gating function could further close the faithfulness gap.
- Broader modality extensions: While early experiments on vision‑text are promising, systematic evaluation on video‑audio‑text triads and on large‑scale production datasets remains open.
Bottom line: The paper uncovers a hidden “arbitration bias” in audio‑language models and offers a practical, training‑free remedy that can be dropped into existing pipelines, paving the way for multimodal systems that truly listen when the text lies.*
Authors
- Yichen Gao
- Yiqun Zhang
- Zijing Wang
- Yujia Li
- Heng Guo
- Xi Wu
- Xiaocui Yang
- Shi Feng
- Yifei Zhang
- Daling Wang
Paper Information
- arXiv ID: 2606.05161v1
- Categories: cs.SD, cs.CL
- Published: June 3, 2026
- PDF: Download PDF