[Paper] UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation
Source: arXiv - 2512.07831v1
Overview
UnityVideo tackles a core shortcoming of current video‑generation models: they are usually conditioned on a single modality (e.g., text or a single visual cue), which limits their ability to understand and respect the physical world. By jointly learning from segmentation masks, human skeletons, DensePose, optical flow, and depth maps, the authors present a unified, “world‑aware” video generator that produces more coherent, physically plausible footage and generalizes better to unseen scenarios.
Key Contributions
- Unified multi‑modal framework that simultaneously ingests five complementary visual modalities during training.
- Dynamic noising scheme that harmonizes heterogeneous training objectives (diffusion, reconstruction, etc.) into a single optimization pipeline.
- Modality switcher + in‑context learner: a lightweight controller that dynamically re‑configures the backbone for each modality without duplicating parameters.
- Large‑scale unified dataset (≈1.3 M video clips with aligned multimodal annotations) released to the community.
- Empirical gains: faster convergence, higher video fidelity, stronger temporal consistency, and markedly better zero‑shot performance on out‑of‑distribution videos.
Methodology
- Data unification – The authors collect a massive corpus where every video frame is paired with segmentation, pose, DensePose, optical‑flow, and depth maps. This creates a “multimodal canvas” that the model can attend to.
- Dynamic noising – Instead of training separate diffusion processes for each modality, they inject noise in a modality‑aware manner, allowing a single denoising network to learn to reconstruct any of the five signals from a noisy version.
- Modality switcher – A small gating module receives a one‑hot modality token and produces a set of scaling vectors (similar to FiLM layers). These vectors modulate the main transformer/UNet backbone, effectively turning the same network into a specialist for the requested modality.
- In‑context learner – During inference, a short “prompt” consisting of a few example frames (and their modalities) is fed to the switcher, enabling the model to adapt its generation style on the fly (e.g., prioritize depth consistency for a driving scene).
- Joint optimization – All modalities share the same loss backbone (a combination of diffusion reconstruction loss, perceptual loss, and motion‑consistency loss). The unified objective forces the network to learn cross‑modal correlations (e.g., depth ↔ optical flow) that improve world reasoning.
Results & Findings
| Metric (higher is better) | Baseline (single‑modality) | UnityVideo |
|---|---|---|
| FVD (Fréchet Video Distance) | 210 | 138 |
| Temporal Consistency (TC‑Score) | 0.71 | 0.84 |
| Zero‑shot Generalization (on unseen domain) | 0.62 | 0.78 |
| Convergence epochs (to 90 % of final quality) | 300 | 180 |
- Visual quality: Samples show sharper textures, fewer flickering artifacts, and more accurate human motion (skeletons stay aligned with generated bodies).
- Physical plausibility: Depth‑aware generation respects occlusions; optical‑flow consistency reduces “ghosting” when objects move fast.
- Zero‑shot robustness: When evaluated on a completely new dataset (e.g., underwater footage), UnityVideo maintains higher fidelity than text‑only diffusion models, confirming that multimodal grounding yields better world models.
Practical Implications
- Game & VR content pipelines – Developers can generate cutscenes or background loops that automatically respect scene geometry and character rigs, cutting down manual key‑framing.
- Synthetic data for perception – Autonomous‑driving stacks need aligned video, depth, and flow; UnityVideo can produce unlimited, physically consistent training data, accelerating simulation‑to‑real transfer.
- Rapid prototyping of visual effects – VFX artists can supply a rough pose or segmentation mask and let the model fill in realistic motion and lighting, dramatically shortening iteration cycles.
- Cross‑modal editing tools – Because the same backbone can be switched on‑the‑fly, a UI could let users toggle between “edit depth,” “adjust pose,” or “refine flow” without re‑training separate models.
Limitations & Future Work
- Computational cost – Joint training on 1.3 M multimodal clips still requires multi‑GPU clusters; inference latency is higher than single‑modality diffusion models.
- Modal coverage – The current set excludes audio, text, or high‑level scene graphs, which could further enrich world understanding.
- Domain bias – The dataset leans heavily toward indoor and urban scenes; performance on highly stochastic domains (e.g., crowds, fluid simulations) remains to be explored.
Future directions include extending the modality switcher to handle audio‑visual cues, optimizing the architecture for real‑time generation, and investigating self‑supervised modality discovery to reduce reliance on costly annotations.
UnityVideo demonstrates that a truly “world‑aware” video generator is within reach when we let multiple visual signals speak to each other. For developers eager to harness AI‑generated motion that respects physics and geometry, the released code and dataset provide a solid foundation to build next‑generation content creation tools.
Authors
- Jiehui Huang
- Yuechen Zhang
- Xu He
- Yuan Gao
- Zhi Cen
- Bin Xia
- Yan Zhou
- Xin Tao
- Pengfei Wan
- Jiaya Jia
Paper Information
- arXiv ID: 2512.07831v1
- Categories: cs.CV
- Published: December 8, 2025
- PDF: Download PDF