[Paper] Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy
Source: arXiv - 2511.21579v1
Overview
The paper “Harmony: Harmonizing Audio and Video Generation through Cross‑Task Synergy” tackles a core bottleneck in generative AI: creating audio‑visual content where sound and image stay tightly synchronized. By dissecting why current diffusion‑based models drift out of sync, the authors propose a suite of techniques that dramatically improve alignment without sacrificing visual or auditory quality.
Key Contributions
- Cross‑Task Synergy training – jointly trains audio‑driven video generation and video‑driven audio generation, using each modality as a strong supervisory signal for the other.
- Global‑Local Decoupled Interaction (GLDI) module – separates coarse global attention from fine‑grained local temporal interactions, enabling efficient and precise timing alignment.
- Synchronization‑Enhanced Classifier‑Free Guidance (SyncCFG) – modifies the standard CFG inference step to isolate and boost the cross‑modal alignment component.
- State‑of‑the‑art results – achieves higher fidelity and markedly better fine‑grained audio‑visual sync on benchmark datasets compared with prior open‑source methods.
Methodology
-
Problem Diagnosis – The authors identify three failure modes in joint diffusion:
- Correspondence Drift: noisy latent updates for audio and video diverge over time.
- Inefficient Global Attention: standard transformers miss subtle temporal cues needed for sync.
- Intra‑modal CFG Bias: classic CFG strengthens conditional generation but ignores cross‑modal timing.
-
Cross‑Task Synergy – Instead of training a single audio‑to‑video or video‑to‑audio model, Harmony alternates between the two tasks within the same diffusion framework. The output of one task (e.g., generated video) serves as a “ground‑truth” guide for the opposite task, anchoring the latent trajectories and reducing drift.
-
GLDI Module – The diffusion backbone is split into:
- A global branch that captures overall scene context with a lightweight attention map.
- A local branch that focuses on short‑range temporal windows, applying a specialized interaction layer that aligns audio waveforms with video frame sequences.
This decoupling keeps computation tractable while preserving the fine timing needed for lip‑sync, foot‑step sounds, etc.
-
SyncCFG – During inference, the guidance term is decomposed into an alignment part and a content part. SyncCFG amplifies the alignment term, ensuring the model prioritizes keeping audio and video in lockstep while still respecting the original prompts.
Results & Findings
- Quantitative Gains: Harmony improves the SyncScore (a metric for temporal alignment) by ~30 % over the previous best open‑source baseline, while also raising FID/IS scores for visual quality.
- Qualitative Improvements: In user studies, participants rated Harmony‑generated clips as noticeably more “in sync” for challenging scenarios like fast‑talking speech, musical instrument playing, and dynamic action scenes.
- Efficiency: The GLDI module reduces attention‑related FLOPs by ~40 % compared to a full‑resolution transformer, enabling generation on a single RTX 4090 in under 8 seconds for a 5‑second clip.
Practical Implications
- Content Creation Pipelines: Video editors and game developers can now rely on a single open‑source model to generate both background music/sfx and matching visuals, cutting down on manual lip‑sync or Foley work.
- Interactive Media & VR: Real‑time avatars or virtual assistants that speak and gesture can maintain tight audio‑visual coherence, improving user immersion.
- Accessibility Tools: Automated captioning or sign‑language generation systems can benefit from synchronized audio‑visual outputs, making them more reliable for deaf or hard‑of‑hearing users.
- Rapid Prototyping: Start‑ups building AI‑driven advertising or social‑media content can integrate Harmony as a plug‑and‑play module, reducing the need for separate audio‑generation and video‑generation stacks.
Limitations & Future Work
- Domain Generalization: The model is trained on curated datasets (e.g., speech‑driven clips, musical performances). Performance may degrade on highly stylized or non‑naturalistic content (e.g., abstract animation).
- Long‑Form Consistency: While short clips (≤10 s) stay well‑aligned, maintaining sync over longer narratives still poses challenges.
- Hardware Requirements: Despite the GLDI efficiency gains, high‑quality generation still demands a modern GPU; lighter‑weight inference variants are an open research direction.
Future work could explore curriculum learning for longer sequences, domain‑adaptive fine‑tuning for niche media styles, and integration with text‑to‑speech models to create a fully end‑to‑end multimodal generation suite.
Authors
- Teng Hu
- Zhentao Yu
- Guozhen Zhang
- Zihan Su
- Zhengguang Zhou
- Youliang Zhang
- Yuan Zhou
- Qinglin Lu
- Ran Yi
Paper Information
- arXiv ID: 2511.21579v1
- Categories: cs.CV
- Published: November 26, 2025
- PDF: Download PDF