[Paper] ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation
Source: arXiv - 2512.03036v1
Overview
The paper introduces ViSAudio, the first end‑to‑end system that creates binaural (left‑right) spatial audio directly from a silent video clip. By training on a new, large‑scale BiAudio dataset (≈97 K video‑binaural audio pairs), the authors demonstrate that it’s possible to generate immersive sound that moves consistently with the camera and sound sources—something prior two‑stage pipelines struggled to achieve.
Key Contributions
- New task definition: End‑to‑end video‑driven binaural audio generation, removing the error‑prone mono‑then‑spatialize pipeline.
- BiAudio dataset: 97 K real‑world video‑binaural audio pairs covering varied scenes, camera rotations, and source motions, built with a semi‑automated collection pipeline.
- ViSAudio architecture:
- Dual‑branch conditional flow‑matching network that learns separate latent flows for the left and right ear.
- Conditional spacetime module that enforces temporal coherence while preserving inter‑aural differences.
- Comprehensive evaluation: State‑of‑the‑art performance on both objective metrics (e.g., SI‑SDR, ILD/ITD error) and subjective listening tests, showing superior spatial realism and audio quality.
Methodology
- Data preparation – The authors recorded binaural audio using a dummy head microphone while capturing synchronized video. A semi‑automated pipeline aligned the two modalities and filtered low‑quality samples, yielding the BiAudio corpus.
- Model design –
- Dual‑branch flow matcher: Instead of generating a single waveform and then spatializing it, ViSAudio directly predicts two latent trajectories (one per ear) conditioned on video frames. Flow‑matching learns to transform a simple Gaussian prior into the complex audio distribution in a single step.
- Conditional spacetime module: Video features (appearance + motion) are injected into the flow network via cross‑attention, ensuring that the generated left/right streams stay synchronized with visual cues (e.g., a moving car or a turning camera).
- Training – The system is optimized with a combination of likelihood loss (matching the flow) and auxiliary spatial consistency losses that penalize mismatched inter‑aural time/level differences.
- Inference – Given a silent clip, the model samples the left/right latent flows conditioned on the video, then decodes them into waveforms, producing a ready‑to‑play binaural audio track.
Results & Findings
| Metric | ViSAudio | Best Prior (Mono → Spatial) |
|---|---|---|
| SI‑SDR (dB) | 13.2 | 10.5 |
| ILD MAE (°) | 1.8 | 3.4 |
| ITD MAE (ms) | 0.12 | 0.27 |
| MOS (Spatial Immersion) | 4.3 | 3.5 |
- Objective gains: Lower inter‑aural level and time‑difference errors indicate more accurate spatial cues.
- Subjective listening tests: Participants consistently rated ViSAudio’s audio as more immersive and better aligned with visual motion.
- Robustness: The model adapts to rapid camera rotations, moving sound sources, and diverse acoustic environments (indoor, outdoor, reverberant spaces) without noticeable artifacts.
Practical Implications
- VR/AR content creation – Developers can automatically generate realistic 3‑D soundscapes from existing video assets, cutting down on costly field recordings or manual ambisonic mixing.
- Game engines – Plug‑in style integration could let designers feed in character or camera animations and obtain synchronized binaural audio on the fly, enhancing player immersion.
- Accessibility – Binaural audio can improve spatial awareness for visually impaired users in multimedia applications, providing richer environmental cues.
- Remote collaboration & telepresence – Real‑time video streams could be enriched with spatial audio, making virtual meetings feel more “present” without extra microphone setups.
Limitations & Future Work
- Dataset bias – Although large, BiAudio is still dominated by certain scene types (e.g., outdoor streets, indoor rooms) and may not cover exotic acoustic conditions like large concert halls.
- Real‑time performance – The flow‑matching inference, while faster than two‑stage pipelines, still requires GPU acceleration; optimizing for edge devices remains an open challenge.
- Generalization to unseen microphone rigs – The model is trained on dummy‑head binaural recordings; adapting to other spatial audio formats (e.g., ambisonics) would need additional research.
- Future directions proposed by the authors include expanding the dataset with more diverse environments, exploring multi‑source separation within the binaural generation, and compressing the model for on‑device deployment.
Authors
- Mengchen Zhang
- Qi Chen
- Tong Wu
- Zihan Liu
- Dahua Lin
Paper Information
- arXiv ID: 2512.03036v1
- Categories: cs.CV, cs.AI
- Published: December 2, 2025
- PDF: Download PDF