[Paper] Benchmarking Single-Factor Physical Video-to-Audio Generation

Published: 1 week ago (May 28, 2026 at 01:59 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.30339v1

Overview

The paper Benchmarking Single-Factor Physical Video-to-Audio Generation introduces FlatSounds, a new evaluation suite that asks video‑to‑audio (V2A) models whether they truly understand the physics behind a scene, not just how “real‑looking” the sound is. By systematically tweaking one physical factor at a time (e.g., object material, impact force) and checking the resulting audio, the authors expose a hidden trade‑off in current state‑of‑the‑art systems: they lean heavily on textual captions for semantics while often missing the correct timing and physical cues from the visual stream.

Key Contributions

FlatSounds benchmark: a curated collection of counterfactual video pairs and single‑video pattern tests that isolate a single physical variable per experiment.
Physics‑aware evaluation metrics: quantitative scores for physical correctness, temporal alignment, and semantic consistency that correlate with human preference judgments.
Comprehensive audit of SOTA V2A models (e.g., AudioLDM, V2A‑GAN, DiffWave‑V2A), revealing a consistent reliance on captions over visual cues.
Insightful analysis of the caption‑vs‑visual trade‑off, showing captions improve semantic and physical accuracy but degrade temporal synchronization.
Open‑source release of the dataset, evaluation code, and a project webpage for reproducibility and community benchmarking.

Methodology

Controlled Counterfactual Pairs – For each base video, the authors generate a partner video that differs in exactly one physical factor (e.g., swapping a wooden block for a metal one while keeping the motion identical).
Single‑Video Pattern Tests – A single video is presented with multiple, systematically varied captions (or visual perturbations) to probe whether the model’s audio output follows expected directional trends (e.g., louder sound for higher impact velocity).
Metrics –
- Physical Consistency: compares acoustic features (spectral centroid, onset strength) against ground‑truth physics parameters.
- Temporal Alignment: measures lag/lead between visual events (e.g., a ball hitting the floor) and audio onsets using dynamic time warping.
- Semantic Accuracy: evaluates whether the generated sound matches the caption’s object class using a pretrained audio classifier.
Human Validation – A subset of the benchmark is run through crowd‑sourced preference tests; correlation analysis confirms that the physics‑based metrics predict human judgments.

The pipeline is deliberately simple enough for developers to plug in any V2A model and obtain a detailed “physics audit” without needing a deep background in acoustics or physics simulation.

Results & Findings

Caption Dominance: Adding a textual description boosts semantic and physical correctness by ~12% on average, but temporal alignment drops by ~8%, indicating models treat captions as a shortcut rather than grounding sound in visual motion.
Visual Stream Under‑utilized: Even when captions are omitted, models struggle to capture fine‑grained physical variations (e.g., material changes) – accuracy on the counterfactual test falls below 55% for most SOTA systems.
Trade‑off Curve: Plotting physical consistency vs. temporal alignment reveals a clear Pareto frontier; improving one metric typically harms the other under current architectures.
Metric‑Human Correlation: Pearson r = 0.78 between the composite physics score and human preference rankings, confirming the benchmark’s relevance to real‑world perception.

Overall, the study shows that current V2A research is still “audio‑first”: models excel at producing plausible sounds but lack a robust understanding of the underlying physics that generated them.

Practical Implications

Game & VR Audio Engines: Developers can use FlatSounds to verify that procedural sound generators react correctly to physics changes (e.g., different surface materials), leading to more immersive experiences.
Robotics & Simulation: In training agents that rely on audio cues (e.g., for fault detection), a physics‑aware V2A model can provide realistic synthetic sound that respects cause‑effect relationships.
Multimodal Content Creation: Tools that auto‑generate soundtracks for video editors can now be benchmarked for temporal fidelity, reducing the need for manual post‑editing to fix misaligned audio spikes.
Model Design Guidance: The revealed caption‑vs‑visual trade‑off suggests that future architectures should incorporate tighter visual‑audio cross‑attention or explicit physics priors (e.g., differentiable simulators) rather than leaning on text prompts.

By offering a concrete, reproducible way to test “does the sound make sense physically?”, FlatSounds pushes the community toward models that are not just pretty‑sounding but also trustworthy for downstream applications where timing and causality matter.

Limitations & Future Work

Scope of Physical Factors: The benchmark currently covers a limited set of factors (material, impact force, object size). Extending to fluid dynamics, friction, or multi‑object interactions would broaden its relevance.
Dataset Size & Diversity: FlatSounds uses synthetic, controlled videos; real‑world footage with noisy lighting or occlusions may expose additional challenges.
Model‑Specific Bias: The study focuses on a handful of publicly available V2A models; custom or proprietary systems could behave differently.
Future Directions: The authors propose integrating differentiable physics engines into the training loop, exploring self‑supervised physical consistency losses, and expanding the benchmark to multi‑modal tasks (e.g., video‑to‑audio‑to‑text).

If you’re building audio‑centric AI products, give FlatSounds a spin. It’s a practical sanity‑check that can surface hidden physics bugs before they become costly post‑production fixes.

Authors

Tingle Li
Siddharth Gururani
Kevin J. Shih
Gantavya Bhatt
Sang-gil Lee
Zhifeng Kong
Arushi Goel
Gopala Anumanchipalli
Ming-Yu Liu

Paper Information

arXiv ID: 2605.30339v1
Categories: cs.CV, cs.MM, cs.SD, eess.AS
Published: May 28, 2026
PDF: Download PDF

[Paper] Benchmarking Single-Factor Physical Video-to-Audio Generation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

[Paper] KLIP: localized distribution shift detection via KL-divergence with diffusion priors in Inverse Problems

[Paper] TunerDiT: Training-free Progressive Steering of Diffusion Transformer for Multi-Event Video Generation

[Paper] Vision-Language Models Suppress Female Representations Under Ambiguous Input