[Paper] CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos
Source: arXiv - 2601.10632v1
Overview
The paper introduces CoMoVi, a novel framework that simultaneously generates realistic 3D human motion sequences and their corresponding 2‑D video renderings. By tightly coupling a motion diffusion model with a video diffusion model, the authors demonstrate that the two generation tasks can reinforce each other, producing more plausible and temporally consistent human avatars than when tackled separately.
Key Contributions
- Co‑generative diffusion architecture: A dual‑branch diffusion model that jointly denoises 3‑D motion and video frames in a single loop, enabling mutual conditioning.
- 2‑D motion representation for video priors: A compact projection of 3‑D joint trajectories onto the image plane that can be directly consumed by pre‑trained video diffusion models.
- Cross‑modal attention mechanisms: 3‑D‑2‑D cross‑attention layers that let motion features inform video synthesis and vice‑versa, preserving kinematic consistency.
- CoMoVi Dataset: A curated, large‑scale collection of real‑world human videos annotated with textual descriptions and 3‑D motion capture data, covering a wide variety of actions and environments.
- State‑of‑the‑art results: Empirical evaluation shows superior performance on both motion quality (e.g., lower MPJPE, higher diversity) and video realism (e.g., higher FVD/IS scores) compared with decoupled baselines.
Methodology
-
Motion Encoding
- Raw 3‑D joint positions are projected onto a 2‑D heat‑map‑like representation (akin to pose skeleton images) that retains spatial relationships while being compatible with image‑based diffusion models.
-
Dual‑Branch Diffusion
- Two parallel diffusion streams are instantiated: one for the 2‑D motion representation, the other for RGB video frames.
- At each denoising timestep, mutual feature interaction layers exchange latent embeddings between the streams.
- 3‑D‑2‑D cross‑attention modules align motion tokens with video tokens, ensuring that generated pixels follow the underlying skeletal motion.
-
Training
- The model is trained end‑to‑end on the CoMoVi Dataset using a standard diffusion loss (noise prediction) plus auxiliary consistency losses that penalize mismatches between the reconstructed 3‑D pose (back‑projected from the 2‑D representation) and the ground‑truth motion.
-
Inference
- Given a textual prompt (or a seed motion), the diffusion process produces a synchronized pair of motion trajectories and video frames in a single forward pass, eliminating the need for post‑hoc retargeting or rendering pipelines.
Results & Findings
| Metric | Motion Generation (CoMoVi) | Prior Motion‑Only Diffusion |
|---|---|---|
| MPJPE (mm) | 28.4 | 35.7 |
| Diversity (Std) | 1.12 | 0.84 |
| FVD (lower better) | 78.3 | 112.5 |
| IS (higher better) | 12.6 | 9.4 |
- Higher fidelity: The joint diffusion reduces joint position error by ~20 % compared with a state‑of‑the‑art motion‑only model.
- Better video realism: Fréchet Video Distance improves dramatically, indicating fewer temporal artifacts and more natural lighting/texture.
- Cross‑modal consistency: Qualitative examples show that limbs never “detach” from the body in the video, a common failure when motion and video are generated separately.
- Generalization: The model successfully handles unseen action categories (e.g., parkour, dance) thanks to the strong priors inherited from the pre‑trained video diffusion backbone.
Practical Implications
- Game & VR content pipelines: Developers can generate high‑quality character animations and corresponding cut‑scenes on‑the‑fly, reducing reliance on costly motion‑capture sessions.
- Synthetic data for training: CoMoVi can produce paired video‑motion datasets for downstream tasks such as pose estimation, action recognition, or reinforcement‑learning agents that need realistic visual feedback.
- Rapid prototyping for AR/Metaverse: Designers can input a textual description (“a person doing a backflip on a beach”) and instantly obtain a synchronized 3‑D animation and video preview, accelerating concept iteration.
- Film & advertising: Automated generation of crowd or background human actions that stay consistent across camera moves, saving manual rotoscoping and key‑framing effort.
Limitations & Future Work
- Resolution & detail: The current implementation focuses on 256×256 video frames; higher‑resolution outputs would be needed for production‑grade assets.
- Complex interactions: The model handles a single human subject; extending to multi‑person scenes or interactions with objects remains an open challenge.
- Physical plausibility: While kinematic consistency improves, the diffusion process does not enforce dynamics (e.g., ground reaction forces), which can lead to subtle physics violations.
- Dataset bias: The CoMoVi Dataset, though diverse, is still skewed toward outdoor, well‑lit scenarios; future work could incorporate indoor, low‑light, and occluded settings.
Overall, CoMoVi showcases a promising direction where generative video models and 3‑D motion synthesis are no longer isolated modules but collaborative partners, opening new avenues for content creation and synthetic data generation in the developer ecosystem.
Authors
- Chengfeng Zhao
- Jiazhi Shu
- Yubo Zhao
- Tianyu Huang
- Jiahao Lu
- Zekai Gu
- Chengwei Ren
- Zhiyang Dou
- Qing Shuai
- Yuan Liu
Paper Information
- arXiv ID: 2601.10632v1
- Categories: cs.CV
- Published: January 15, 2026
- PDF: Download PDF