[Paper] DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation
Source: arXiv - 2602.12160v1
Overview
DreamID‑Omni is a new “one‑stop‑shop” framework that lets you generate or edit human‑centric audio‑video content with fine‑grained control over who appears on screen and what they sound like. By unifying three previously separate tasks—reference‑based audio‑video generation, video‑to‑audio editing, and audio‑driven animation—DreamID‑Omni pushes the boundary of what can be done with foundation models in a single, developer‑friendly pipeline.
Key Contributions
- Unified architecture that handles reference‑based generation, video editing, and audio‑driven animation under one model.
- Symmetric Conditional Diffusion Transformer (SCDT) – a diffusion‑based transformer that injects heterogeneous conditioning signals (e.g., face images, voice clips, textual captions) symmetrically, preserving consistency across modalities.
- Dual‑Level Disentanglement:
- Signal‑level: Synchronized Rotary Positional Encoding (RoPE) tightly binds each identity/timbre to its attention space, preventing speaker confusion.
- Semantic‑level: Structured captions explicitly map attributes (e.g., “John’s deep voice”) to subjects, enabling multi‑person control.
- Multi‑Task Progressive Training – a curriculum that starts with weakly‑constrained generative priors and gradually introduces strongly‑constrained tasks, avoiding over‑fitting and harmonizing disparate objectives.
- State‑of‑the‑art performance on video quality, audio fidelity, and cross‑modal consistency, surpassing several commercial solutions.
Methodology
- Diffusion Backbone – The core is a latent diffusion model that iteratively denoises a joint audio‑video latent representation.
- Symmetric Conditional Injection – Conditioning data (face images, voice waveforms, textual descriptors) are embedded and injected into both the encoder and decoder sides of the transformer, ensuring that each modality influences generation symmetrically.
- Dual‑Level Disentanglement
- Synchronized RoPE: Each conditioning token receives a rotary positional encoding that is synchronized across the audio and visual streams, locking a specific identity/timbre to a fixed attention sub‑space.
- Structured Captions: During training, captions are parsed into “subject‑attribute” pairs (e.g., “Alice – high‑pitched voice”), which are fed as explicit tokens so the model learns a deterministic mapping.
- Progressive Multi‑Task Training – The model is first trained on loosely constrained tasks (e.g., unconditional video synthesis) to learn generic priors, then gradually fine‑tuned on tightly constrained tasks (e.g., R2AV with exact identity‑voice pairing). This curriculum mitigates catastrophic forgetting and balances the loss terms across tasks.
Results & Findings
| Metric | DreamID‑Omni | Prior SOTA (Academic) | Leading Commercial Model |
|---|---|---|---|
| Video FID ↓ | 12.3 | 18.7 | 15.4 |
| Audio PESQ ↑ | 4.2 | 3.6 | 3.9 |
| Audio‑Visual Sync (AV‑Sync) ↑ | 0.92 | 0.78 | 0.84 |
| Multi‑Person Identity Accuracy ↑ | 94.1 % | 81.3 % | 86.7 % |
- DreamID‑Omni consistently outperforms both academic baselines and a top‑tier commercial API on all three fronts: visual realism, audio quality, and cross‑modal alignment.
- In multi‑speaker scenarios (e.g., a dialogue between two characters), the Dual‑Level Disentanglement reduces speaker swapping errors from >15 % to <6 %.
- Ablation studies confirm that removing either the symmetric conditional injection or the progressive training pipeline drops performance by 7–10 % across metrics.
Practical Implications
- Content Creation Platforms – Integrate DreamID‑Omni to let creators generate synthetic interview clips, dubbing, or virtual avatars with precise control over who says what, without manual lip‑sync editing.
- Game & VR Development – Use the model for on‑the‑fly character animation driven by voice actors, enabling dynamic NPC dialogues that stay visually consistent with the speaker’s identity.
- Accessibility Tools – Generate sign‑language videos where the signer’s face and voice are matched to the spoken content, improving inclusivity for deaf users.
- Rapid Prototyping – Developers can feed a single reference image and a voice clip to produce a full‑length video, cutting down production time for marketing or training videos.
- API‑Ready Service – The authors plan to open‑source the code, making it straightforward to wrap the model behind a REST endpoint, similar to existing text‑to‑image APIs.
Limitations & Future Work
- Compute‑Intensive – Training and inference still require high‑end GPUs (≥ 40 GB VRAM) due to the diffusion transformer’s size, limiting on‑device deployment.
- Domain Generalization – The model performs best on well‑lit, frontal faces; extreme poses, occlusions, or low‑quality audio still degrade results.
- Ethical Safeguards – While the paper discusses misuse mitigation, concrete detection tools for synthetic media are not integrated.
- Future Directions suggested by the authors include:
- Lightweight distillation for edge‑device inference.
- Extending the conditioning to full‑body motion and expressive gestures.
- Incorporating explicit bias‑control mechanisms to prevent unintended demographic stereotypes.
DreamID‑Omni bridges the gap between research‑grade multimodal generation and production‑ready tools, offering developers a powerful new lever for creating controllable, high‑fidelity human‑centric audio‑video content.
Authors
- Xu Guo
- Fulong Ye
- Qichao Sun
- Liyang Chen
- Bingchuan Li
- Pengze Zhang
- Jiawei Liu
- Songtao Zhao
- Qian He
- Xiangwang Hou
Paper Information
- arXiv ID: 2602.12160v1
- Categories: cs.CV
- Published: February 12, 2026
- PDF: Download PDF