[Paper] ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation

Published: (December 2, 2025 at 01:56 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.03036v1

Overview

The paper introduces ViSAudio, the first end‑to‑end system that creates binaural (left‑right) spatial audio directly from a silent video clip. By training on a new, large‑scale BiAudio dataset (≈97 K video‑binaural audio pairs), the authors demonstrate that it’s possible to generate immersive sound that moves consistently with the camera and sound sources—something prior two‑stage pipelines struggled to achieve.

Key Contributions

  • New task definition: End‑to‑end video‑driven binaural audio generation, removing the error‑prone mono‑then‑spatialize pipeline.
  • BiAudio dataset: 97 K real‑world video‑binaural audio pairs covering varied scenes, camera rotations, and source motions, built with a semi‑automated collection pipeline.
  • ViSAudio architecture:
    • Dual‑branch conditional flow‑matching network that learns separate latent flows for the left and right ear.
    • Conditional spacetime module that enforces temporal coherence while preserving inter‑aural differences.
  • Comprehensive evaluation: State‑of‑the‑art performance on both objective metrics (e.g., SI‑SDR, ILD/ITD error) and subjective listening tests, showing superior spatial realism and audio quality.

Methodology

  1. Data preparation – The authors recorded binaural audio using a dummy head microphone while capturing synchronized video. A semi‑automated pipeline aligned the two modalities and filtered low‑quality samples, yielding the BiAudio corpus.
  2. Model design
    • Dual‑branch flow matcher: Instead of generating a single waveform and then spatializing it, ViSAudio directly predicts two latent trajectories (one per ear) conditioned on video frames. Flow‑matching learns to transform a simple Gaussian prior into the complex audio distribution in a single step.
    • Conditional spacetime module: Video features (appearance + motion) are injected into the flow network via cross‑attention, ensuring that the generated left/right streams stay synchronized with visual cues (e.g., a moving car or a turning camera).
  3. Training – The system is optimized with a combination of likelihood loss (matching the flow) and auxiliary spatial consistency losses that penalize mismatched inter‑aural time/level differences.
  4. Inference – Given a silent clip, the model samples the left/right latent flows conditioned on the video, then decodes them into waveforms, producing a ready‑to‑play binaural audio track.

Results & Findings

MetricViSAudioBest Prior (Mono → Spatial)
SI‑SDR (dB)13.210.5
ILD MAE (°)1.83.4
ITD MAE (ms)0.120.27
MOS (Spatial Immersion)4.33.5
  • Objective gains: Lower inter‑aural level and time‑difference errors indicate more accurate spatial cues.
  • Subjective listening tests: Participants consistently rated ViSAudio’s audio as more immersive and better aligned with visual motion.
  • Robustness: The model adapts to rapid camera rotations, moving sound sources, and diverse acoustic environments (indoor, outdoor, reverberant spaces) without noticeable artifacts.

Practical Implications

  • VR/AR content creation – Developers can automatically generate realistic 3‑D soundscapes from existing video assets, cutting down on costly field recordings or manual ambisonic mixing.
  • Game engines – Plug‑in style integration could let designers feed in character or camera animations and obtain synchronized binaural audio on the fly, enhancing player immersion.
  • Accessibility – Binaural audio can improve spatial awareness for visually impaired users in multimedia applications, providing richer environmental cues.
  • Remote collaboration & telepresence – Real‑time video streams could be enriched with spatial audio, making virtual meetings feel more “present” without extra microphone setups.

Limitations & Future Work

  • Dataset bias – Although large, BiAudio is still dominated by certain scene types (e.g., outdoor streets, indoor rooms) and may not cover exotic acoustic conditions like large concert halls.
  • Real‑time performance – The flow‑matching inference, while faster than two‑stage pipelines, still requires GPU acceleration; optimizing for edge devices remains an open challenge.
  • Generalization to unseen microphone rigs – The model is trained on dummy‑head binaural recordings; adapting to other spatial audio formats (e.g., ambisonics) would need additional research.
  • Future directions proposed by the authors include expanding the dataset with more diverse environments, exploring multi‑source separation within the binaural generation, and compressing the model for on‑device deployment.

Authors

  • Mengchen Zhang
  • Qi Chen
  • Tong Wu
  • Zihan Liu
  • Dahua Lin

Paper Information

  • arXiv ID: 2512.03036v1
  • Categories: cs.CV, cs.AI
  • Published: December 2, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »