[Paper] Audio Interaction Model

Published: (June 3, 2026 at 01:26 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2606.05121v1

Overview

The paper introduces Audio‑Interaction, the first “always‑on” Large Audio Language Model (LALM) that can listen, understand, and respond in real time. By unifying offline audio tasks (e.g., transcription, classification) with streaming capabilities (e.g., live voice chat, proactive assistance), the authors move audio AI from batch‑mode processing to truly interactive applications.

Key Contributions

  • Audio Interaction Model (AIM) paradigm – formalizes a perceive‑decide‑respond loop for continuous, real‑time audio processing.
  • Audio‑Interaction system – a unified streaming LALM that retains offline task performance while adding online instruction following and proactive response generation.
  • SoundFlow framework – end‑to‑end pipeline covering streaming‑native data creation, comprehension‑aware training objectives, and asynchronous low‑latency inference.
  • StreamAudio‑2M dataset – a 2.6 M‑item corpus covering 7 core audio abilities (ASR, classification, detection, etc.) and 28 sub‑tasks, all designed for streaming scenarios.
  • Proactive‑Sound‑Bench – a benchmark suite that evaluates a model’s ability to intervene proactively (e.g., alerting a user to a dangerous sound).
  • Empirical validation – competitive results on eight established audio benchmarks plus new metrics demonstrating real‑time ASR, streaming instruction following, and proactive help.

Methodology

  1. Streaming‑native data construction – raw audio recordings are sliced into overlapping windows with timestamps, preserving temporal context and enabling the model to learn when to “listen” versus “speak.”
  2. Perceive‑Decide‑Respond loop
    • Perceive: a front‑end encoder continuously extracts frame‑level embeddings from the incoming audio stream.
    • Decide: a transformer‑based decision module consumes the embeddings together with any textual instruction, producing a latent “intent” vector and a binary “speak‑now” flag.
    • Respond: when the flag is set, a decoder generates the appropriate audio or text response (e.g., transcribed text, spoken reply).
  3. Comprehension‑aware training – loss functions combine standard task‑specific objectives (CTC for ASR, cross‑entropy for classification) with a response‑timing loss that penalizes premature or delayed replies, teaching the model to act at the right moment.
  4. Asynchronous low‑latency inference – the system runs perception and decision modules on separate threads, allowing the decoder to start generating output before the entire input segment finishes, achieving sub‑200 ms end‑to‑end latency on commodity GPUs.

Results & Findings

BenchmarkOffline LALM (baseline)Audio‑Interaction (streaming)
LibriSpeech (ASR)2.3 % WER2.4 % (no degradation)
AudioSet (classification)0.78 mAP0.77 mAP
Streaming ASR (real‑time)95 % word‑accuracy at 100 ms latency
Voice‑Chat (dialogue)Human‑rated fluency 4.6/5
Proactive‑Sound‑Bench0.31 F10.68 F1 (new capability)

Key takeaways:

  • No trade‑off on traditional offline tasks – the unified model matches specialist systems.
  • Real‑time performance – sub‑200 ms latency enables live transcription and immediate voice assistance.
  • Proactive behavior – the model can detect critical sounds (e.g., alarms) and intervene without explicit prompts, a capability absent from prior LALMs.

Practical Implications

  • Live assistants & smart speakers: devices can now listen continuously, understand user intent, and respond only when appropriate, reducing unnecessary interruptions.
  • Safety‑critical monitoring: factories, hospitals, or homes can deploy a single model that both logs audio events and actively warns users about hazardous sounds (e.g., smoke alarms, equipment failures).
  • Multimodal collaboration tools: developers can embed Audio‑Interaction into video‑conferencing platforms for on‑the‑fly captioning, language translation, and voice‑controlled UI actions.
  • Developer ergonomics: the SoundFlow pipeline provides ready‑made streaming data loaders and low‑latency inference wrappers, lowering the barrier to build custom real‑time audio applications.

Limitations & Future Work

  • Hardware dependence – achieving sub‑200 ms latency still requires a GPU; ultra‑low‑power edge devices may need model compression.
  • Scope of proactive tasks – the benchmark covers a limited set of safety sounds; broader environmental awareness (e.g., wildlife monitoring) remains unexplored.
  • Multilingual support – current experiments focus on English; extending the decision module to handle multilingual instruction following is an open direction.
  • Robustness to noisy streams – while the model tolerates moderate background noise, extreme acoustic conditions (e.g., reverberant rooms) can degrade timing decisions.

Audio‑Interaction opens the door to truly conversational, always‑on audio AI, and the accompanying SoundFlow ecosystem gives developers a practical path to bring these capabilities into production today.

Authors

  • Zhifei Xie
  • Zihang Liu
  • Ze An
  • Xiaobin Hu
  • Yue Liao
  • Ziyang Ma
  • Dongchao Yang
  • Mingbao Lin
  • Deheng Ye
  • Shuicheng Yan
  • Chunyan Miao

Paper Information

  • arXiv ID: 2606.05121v1
  • Categories: cs.SD, cs.AI, cs.CL, cs.MM, eess.AS
  • Published: June 3, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »