[Paper] JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion
Source: arXiv - 2601.22143v1
Overview
The paper “JUST‑DUB‑IT: Video Dubbing via Joint Audio‑Visual Diffusion” shows how a single, foundation‑level diffusion model can be turned into a practical video‑dubbing engine. By fine‑tuning the model with a lightweight LoRA (Low‑Rank Adaptation), the authors achieve simultaneous translation of speech and realistic lip‑sync for the original speaker—without the cumbersome, multi‑stage pipelines that dominate current dubbing tools.
Key Contributions
- Unified dubbing model – Adapts a pretrained audio‑visual diffusion model to perform translation, speech synthesis, and facial motion generation in one pass.
- LoRA‑based conditioning – Introduces a small, trainable LoRA that lets the model ingest an existing audio‑visual clip and output a dubbed version while preserving identity.
- Synthetic multilingual training data – Uses the diffusion model itself to create paired multilingual video clips (language switches within a single clip) and then inpaints each half, eliminating the need for costly manually labeled dubbing datasets.
- Robustness to real‑world dynamics – Demonstrates high‑fidelity lip synchronization even with complex head motion, lighting changes, and background activity.
- Quantitative and perceptual gains – Shows measurable improvements over state‑of‑the‑art dubbing pipelines in visual fidelity, sync accuracy, and overall video quality.
Methodology
- Base Model – Starts from a large audio‑visual diffusion model pretrained to jointly generate sound and video frames.
- LoRA Fine‑Tuning – Adds a low‑rank adapter (LoRA) to the model’s cross‑modal attention layers. This adapter learns to condition the generation on an input video‑audio pair while still leveraging the strong generative prior of the base model.
- Synthetic Paired Data Generation
- The base diffusion model creates a multilingual version of a source clip by swapping the spoken language mid‑clip.
- Each half of the clip is then inpainted: the audio is replaced with the target language, and the face region is regenerated to match the new speech.
- The result is a paired dataset of “original ↔ dubbed” videos for the same speaker, automatically generated at scale.
- Training Loop – The LoRA is trained on these synthetic pairs to learn the mapping from source audio‑visual content to the dubbed output, preserving speaker identity and motion cues.
- Inference – At test time, a user supplies a video and a target language transcript. The LoRA‑augmented diffusion model produces a new audio track and a synchronized facial animation in a single forward pass.
Results & Findings
- Lip‑Sync Accuracy – Achieves a 23 % reduction in lip‑sync error (measured by LSE‑C) compared to the best open‑source dubbing pipeline.
- Visual Fidelity – Improves structural similarity (SSIM) by 0.07 on challenging, fast‑moving clips, indicating fewer artifacts in the regenerated face region.
- Speaker Identity Preservation – Identity similarity scores (using a face‑recognition encoder) remain >0.92, showing the model does not drift to a generic “talking head.”
- Robustness Tests – Maintains high sync and visual quality across diverse settings: outdoor lighting, occlusions, and rapid head turns, where traditional methods often fail.
- User Study – In a blind preference test with 50 participants, 68 % preferred the JUST‑DUB‑IT output over the strongest baseline, citing “more natural lip movement” and “clearer voice.”
Practical Implications
- Content Localization – Media companies can automate dubbing for global releases, cutting down on expensive studio re‑recording and manual lip‑sync work.
- Live Translation – The single‑pass architecture is fast enough (≈2 × real‑time on a single GPU) to be integrated into live‑streaming platforms for on‑the‑fly multilingual broadcasts.
- AR/VR Avatars – Real‑time avatar dubbing for virtual meetings or games becomes feasible, as the model can preserve a user’s facial identity while speaking a different language.
- Accessibility – Enables rapid creation of sign‑language‑augmented videos where the spoken track is translated and the speaker’s mouth movements stay intelligible for lip‑readers.
- Tooling Simplicity – Developers no longer need to stitch together separate speech‑synthesis, lip‑sync, and video‑editing modules; a single API call can handle the whole pipeline.
Limitations & Future Work
- Synthetic Training Gap – Although the model is trained on self‑generated multilingual pairs, subtle domain shifts may appear when dubbing languages with vastly different phonetic mouth shapes (e.g., Mandarin vs. English).
- Resource Requirements – The underlying diffusion model still demands a high‑end GPU for real‑time inference; lighter variants are needed for edge devices.
- Multi‑Speaker Scenarios – Current experiments focus on single‑speaker clips; extending the approach to dialogues with multiple interacting faces remains an open challenge.
- Fine‑Grained Control – The system does not yet expose knobs for adjusting emotional tone or speaking style in the dubbed audio, which could be valuable for creative applications.
Future work will explore domain‑adaptive fine‑tuning on real multilingual dubbing data, model compression techniques for on‑device deployment, and extensions to handle multi‑person scenes and expressive speech control.
Authors
- Anthony Chen
- Naomi Ken Korem
- Tavi Halperin
- Matan Ben Yosef
- Urska Jelercic
- Ofir Bibi
- Or Patashnik
- Daniel Cohen‑Or
Paper Information
- arXiv ID: 2601.22143v1
- Categories: cs.GR, cs.CV
- Published: January 29, 2026
- PDF: Download PDF