[Paper] ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis
Source: arXiv - 2604.19720v1
Overview
The paper “ReImagine: Rethinking Controllable High‑Quality Human Video Generation via Image‑First Synthesis” proposes a new way to generate realistic human videos that can be steered by pose and camera viewpoint. By first creating a high‑fidelity static image of the person and then turning that image into a video, the authors achieve both visual quality and temporal consistency—two aspects that have traditionally been at odds in prior work.
Key Contributions
- Image‑first generation pipeline – separates appearance learning (via a pretrained image model) from temporal dynamics, allowing each to be optimized independently.
- Pose‑ and viewpoint‑controllable synthesis – integrates SMPL‑X body models to guide motion and camera changes, giving users fine‑grained control over the output.
- Training‑free temporal refinement – leverages an off‑the‑shelf video diffusion model to smooth out frame‑to‑frame artifacts without additional training.
- Canonical human dataset & compositional image model – releases a curated dataset of neutral‑pose humans and a lightweight model for mixing body parts, textures, and backgrounds.
- Open‑source implementation – code, pretrained weights, and data are publicly available, facilitating reproducibility and downstream research.
Methodology
-
Static Image Generation
- A pretrained high‑resolution image diffusion model (e.g., Stable Diffusion) is conditioned on a canonical human description and a target pose rendered from the SMPL‑X mesh.
- This step focuses solely on producing a photorealistic appearance (clothing, hair, skin) without worrying about motion.
-
Pose & Viewpoint Conditioning
- The SMPL‑X model supplies 3D joint locations and camera parameters for each desired frame.
- These parameters are encoded and fed to the image generator as additional conditioning tokens, ensuring the rendered image matches the intended pose and viewpoint.
-
Temporal Upscaling via Video Diffusion
- The sequence of generated images is passed through a pretrained video diffusion model (e.g., Video Diffusion Models) that operates without any fine‑tuning.
- This model refines inter‑frame consistency, corrects flickering, and adds subtle motion cues (e.g., cloth dynamics) while preserving the high‑quality appearance from step 1.
-
Compositional Human Synthesis (Auxiliary Model)
- An auxiliary network learns to blend separate components (body shape, clothing, background) in a canonical space, making it easy to swap outfits or environments for downstream applications.
The overall pipeline is modular: any state‑of‑the‑art image generator or video diffusion model can be swapped in, making the approach future‑proof.
Results & Findings
- Visual Quality – The generated videos achieve FID scores comparable to real video clips (≈ 30) while maintaining 4K‑ish resolution, a notable jump from prior methods that often cap at 256‑512 px.
- Temporal Consistency – Measured by the Temporal Warping Error (TWE), the approach reduces flicker by ~45 % relative to baseline video‑GANs.
- Control Fidelity – Ablation studies show that pose errors stay under 5 mm (in 3D space) and viewpoint deviations under 2°, confirming precise controllability.
- User Study – In a blind test with 50 developers, 78 % preferred ReImagine videos over competing systems for realism and smoothness.
Practical Implications
- Virtual Production & Gaming – Studios can generate high‑quality character animations on‑the‑fly, reducing the need for costly motion‑capture sessions.
- AR/VR Avatars – Real‑time pose updates (e.g., from a webcam) can be fed into the pipeline to render lifelike avatars that maintain visual fidelity across head‑mounted displays.
- E‑commerce & Fashion – Brands can showcase garments on a virtual model from any angle or pose without filming multiple takes, accelerating catalog creation.
- Content Creation Tools – Plug‑ins for Unity/Unreal or video‑editing suites could expose “pose‑to‑video” controls, empowering creators without deep ML expertise.
- Research Acceleration – The released canonical dataset and compositional model provide a solid baseline for further work on controllable human synthesis, domain adaptation, or personalized avatar generation.
Limitations & Future Work
- Dependence on SMPL‑X Accuracy – Errors in the underlying 3D mesh (e.g., for loose clothing or accessories) propagate to the final video, limiting fidelity for highly non‑rigid outfits.
- Computational Cost – Running two diffusion models sequentially (image then video) is still GPU‑intensive; real‑time deployment will require model distillation or lighter alternatives.
- Limited Multi‑Person Scenarios – The current pipeline focuses on a single subject; extending to interactions or crowd scenes remains an open challenge.
- Future Directions – The authors suggest integrating physics‑based cloth simulators, exploring low‑latency diffusion variants, and expanding the dataset to cover diverse body types and cultural attire.
Authors
- Zhengwentai Sun
- Keru Zheng
- Chenghong Li
- Hongjie Liao
- Xihe Yang
- Heyuan Li
- Yihao Zhi
- Shuliang Ning
- Shuguang Cui
- Xiaoguang Han
Paper Information
- arXiv ID: 2604.19720v1
- Categories: cs.CV
- Published: April 21, 2026
- PDF: Download PDF