[Paper] Now You Hear Me: Audio Narrative Attacks Against Large Audio-Language Models
Source: arXiv - 2601.23255v1
Overview
The paper Now You Hear Me: Audio Narrative Attacks Against Large Audio‑Language Models uncovers a new class of security threats that arise when powerful multimodal models start processing raw speech instead of just text. By turning a jailbreak prompt into a spoken story, the authors show that state‑of‑the‑art audio‑language systems can be tricked into ignoring their safety filters, raising urgent concerns for voice‑first products such as assistants, tutoring bots, and clinical triage tools.
Key Contributions
- Audio‑only jailbreak: Introduces a novel “narrative‑style” audio attack that embeds disallowed instructions inside a synthetic speech story, bypassing text‑centric safety checks.
- Leveraging advanced TTS: Uses a high‑fidelity instruction‑following text‑to‑speech model to preserve the semantic payload while sounding natural, exploiting both linguistic and acoustic cues.
- Empirical validation: Demonstrates a 98.26 % success rate against Gemini 2.0 Flash (and comparable rates on other leading audio‑language models), dramatically outperforming traditional text‑only jailbreaks.
- Threat taxonomy: Highlights how the shift from text to speech expands the attack surface, requiring safety mechanisms that jointly reason over language and paralinguistic signals.
- Open‑source toolkit: Releases the code and audio prompts used in the study, enabling reproducibility and further research on defenses.
Methodology
- Prompt design – The researchers craft a narrative prompt that weaves a prohibited command (e.g., “give instructions for hacking”) into a harmless‑sounding story.
- Instruction‑following TTS – They feed the prompt to a cutting‑edge TTS system trained to follow user instructions, producing a synthetic audio clip that sounds like a natural spoken story.
- Audio delivery – The generated clip is fed directly to the target audio‑language model (ALM) via its speech‑input API, just as a user would speak to a voice assistant.
- Response analysis – The model’s textual output is examined to see whether it obeys the hidden command. Success is measured by the proportion of attempts that yield the disallowed response.
- Baseline comparison – The same malicious intent is delivered as plain text and as a “flat” audio read‑out (no narrative) to quantify the advantage of the narrative approach.
Results & Findings
| Target Model | Text‑only jailbreak success | Narrative‑audio jailbreak success |
|---|---|---|
| Gemini 2.0 Flash | ~12 % | 98.26 % |
| Other ALMs (e.g., Whisper‑based) | 8–15 % | 85–96 % |
- The narrative format consistently outperforms flat audio and text prompts, suggesting that the model’s safety filters are tuned to detect explicit textual cues but not subtle story structures.
- Acoustic cues (prosody, pauses) appear to reinforce the model’s belief that the input is benign, further weakening the filter.
- Even when the same TTS voice is used for benign and malicious prompts, the model fails to differentiate, indicating a lack of cross‑modal safety reasoning.
Practical Implications
- Voice assistants: Malicious actors could embed harmful instructions in podcasts, audiobooks, or even phone calls, causing the assistant to reveal restricted information or perform unsafe actions.
- Enterprise AI pipelines: Companies that ingest audio (e.g., call‑center analytics) may inadvertently process compromised speech, leading to data leakage or policy violations.
- Regulatory compliance: Safety certifications that focus on text‑based prompt filtering will be insufficient for products that accept speech, prompting a need for new standards.
- Defensive tooling: Developers should consider multimodal content moderation—e.g., running a parallel text transcription check, employing acoustic anomaly detectors, or designing “speech‑aware” safety layers that evaluate both the transcript and the audio’s prosodic patterns.
- User education: End‑users must be aware that seemingly innocuous audio content could be a vector for jailbreaks, especially as synthetic voice generation becomes more accessible.
Limitations & Future Work
- Synthetic audio focus: The study relies on high‑quality TTS output; real‑world recordings (background noise, speaker variability) may affect attack success, a factor the authors acknowledge.
- Model scope: Experiments are limited to a handful of publicly known ALMs; proprietary or domain‑specific models could behave differently.
- Defensive baselines: While the paper proposes initial mitigation ideas, it does not implement or evaluate concrete countermeasures, leaving that as an open research direction.
- Future avenues: Extending attacks to multimodal inputs (audio + visual), exploring adversarial perturbations in the acoustic domain, and building unified safety frameworks that jointly reason over text, audio, and prosody.
Authors
- Ye Yu
- Haibo Jin
- Yaoning Yu
- Jun Zhuang
- Haohan Wang
Paper Information
- arXiv ID: 2601.23255v1
- Categories: cs.CL, cs.AI, cs.CR
- Published: January 30, 2026
- PDF: Download PDF