[Paper] OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding
Source: arXiv - 2512.23646v1
Overview
OmniAgent is a new “audio‑guided” AI agent that actively decides what and when to look at video frames, using sound as the primary cue. By turning perception into a dynamic, tool‑driven process instead of a static, frame‑by‑frame pass, the system achieves far finer audio‑visual reasoning and pushes the state‑of‑the‑art on several benchmark tasks.
Key Contributions
- Audio‑first active perception – Introduces a coarse‑to‑fine pipeline where short audio snippets first locate the relevant temporal segment, then visual analysis is focused only on that region.
- Tool orchestration framework – Implements a planner that dynamically selects and invokes specialized perception modules (e.g., object detector, action recognizer) on demand, rather than running a monolithic model over the whole video.
- Dynamic, query‑driven workflow – Moves away from static, dense captioning pipelines to a “think‑and‑act” loop that asks follow‑up questions and gathers additional evidence only when needed.
- Strong empirical gains – Beats leading open‑source and commercial multimodal models by 10‑20 % absolute accuracy on three diverse audio‑video understanding benchmarks.
- Open‑source friendly design – Built on publicly available LLM back‑ends and modular perception tools, making it easy to extend or replace components.
Methodology
- Coarse Audio Localization – The agent first runs a lightweight audio encoder on the entire clip to detect salient sound events (e.g., a dog bark, a musical note). This step produces a rough time window where the interesting action is likely happening.
- Planner & Tool Selector – A language‑model‑based planner receives the audio cue and the current task description (e.g., “What caused the loud crash?”). It decides which perception tool to call next—such as a face detector, a pose estimator, or a scene classifier—and formulates a precise query for that tool.
- Fine‑grained Visual Inspection – The chosen tool processes only the frames inside the audio‑identified window, drastically reducing compute while preserving detail.
- Iterative Reasoning Loop – The LLM integrates the tool’s output, updates its internal state, and may request additional tools (e.g., ask for optical flow if motion is ambiguous). The loop stops once the answer confidence crosses a threshold.
- Answer Generation – Finally, the LLM synthesizes a natural‑language response that combines audio evidence, visual detections, and any higher‑level reasoning.
The whole pipeline is end‑to‑end trainable via reinforcement‑style rewards that encourage minimal tool usage while maximizing answer accuracy.
Results & Findings
| Benchmark | Prior SOTA | OmniAgent | Δ Accuracy |
|---|---|---|---|
| AVQA (Audio‑Visual Question Answering) | 68.3 % | 78.9 % | +10.6 % |
| VGGSound‑Action | 71.5 % | 84.2 % | +12.7 % |
| MUSIC‑Video (multimodal retrieval) | 73.0 % | 91.5 % | +18.5 % |
- Efficiency: Because visual tools run on a fraction (≈15 %) of the total frames, inference time drops by ~30 % compared with dense captioning baselines.
- Robustness to Noise: Audio‑first localization helps the system ignore irrelevant visual clutter, leading to higher accuracy on videos with busy backgrounds.
- Generalization: The modular tool set allows OmniAgent to adapt to new tasks (e.g., sound‑source separation) with minimal retraining.
Practical Implications
- Developer‑friendly APIs: The tool‑orchestration layer can be exposed as a simple “ask‑question” endpoint; developers can plug in custom detectors (e.g., a proprietary defect‑recognition model) without touching the core LLM.
- Cost‑effective video analytics: Media platforms can run OmniAgent to tag or moderate user‑generated content, focusing compute only where audio indicates something noteworthy (e.g., violent sounds, emergency alarms).
- Enhanced assistive tech: Wearable devices for the hearing impaired could use the audio‑first approach to surface visual context only when a salient sound occurs, preserving battery life.
- Improved multimodal search: E‑commerce sites can let users search “show me videos where a glass breaks” and rely on the audio cue to quickly surface relevant clips, improving user experience.
Limitations & Future Work
- Audio quality dependence: In noisy environments or with low‑fidelity recordings, the initial audio cue may mislocalize events, leading to missed visual evidence.
- Tool selection overhead: The planner’s decision process adds latency in edge‑deployment scenarios; lightweight alternatives are needed for real‑time use.
- Domain transfer: While the modular design eases adaptation, the current set of tools is tuned for generic objects and actions; specialized domains (e.g., medical imaging) will require new tool training.
- Future directions include integrating visual‑first fallback strategies, expanding the tool library (e.g., 3D pose, depth estimation), and exploring self‑supervised audio‑visual alignment to reduce reliance on labeled data.
Authors
- Keda Tao
- Wenjie Du
- Bohan Yu
- Weiqiang Wang
- Jian Liu
- Huan Wang
Paper Information
- arXiv ID: 2512.23646v1
- Categories: cs.CV
- Published: December 29, 2025
- PDF: Download PDF