[Paper] ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

Published: 2 days ago (April 21, 2026 at 01:47 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.19720v1

Overview

The paper “ReImagine: Rethinking Controllable High‑Quality Human Video Generation via Image‑First Synthesis” proposes a new way to generate realistic human videos that can be steered by pose and camera viewpoint. By first creating a high‑fidelity static image of the person and then turning that image into a video, the authors achieve both visual quality and temporal consistency—two aspects that have traditionally been at odds in prior work.

Key Contributions

Image‑first generation pipeline – separates appearance learning (via a pretrained image model) from temporal dynamics, allowing each to be optimized independently.
Pose‑ and viewpoint‑controllable synthesis – integrates SMPL‑X body models to guide motion and camera changes, giving users fine‑grained control over the output.
Training‑free temporal refinement – leverages an off‑the‑shelf video diffusion model to smooth out frame‑to‑frame artifacts without additional training.
Canonical human dataset & compositional image model – releases a curated dataset of neutral‑pose humans and a lightweight model for mixing body parts, textures, and backgrounds.
Open‑source implementation – code, pretrained weights, and data are publicly available, facilitating reproducibility and downstream research.

Methodology

Static Image Generation
- A pretrained high‑resolution image diffusion model (e.g., Stable Diffusion) is conditioned on a canonical human description and a target pose rendered from the SMPL‑X mesh.
- This step focuses solely on producing a photorealistic appearance (clothing, hair, skin) without worrying about motion.
Pose & Viewpoint Conditioning
- The SMPL‑X model supplies 3D joint locations and camera parameters for each desired frame.
- These parameters are encoded and fed to the image generator as additional conditioning tokens, ensuring the rendered image matches the intended pose and viewpoint.
Temporal Upscaling via Video Diffusion
- The sequence of generated images is passed through a pretrained video diffusion model (e.g., Video Diffusion Models) that operates without any fine‑tuning.
- This model refines inter‑frame consistency, corrects flickering, and adds subtle motion cues (e.g., cloth dynamics) while preserving the high‑quality appearance from step 1.
Compositional Human Synthesis (Auxiliary Model)
- An auxiliary network learns to blend separate components (body shape, clothing, background) in a canonical space, making it easy to swap outfits or environments for downstream applications.

The overall pipeline is modular: any state‑of‑the‑art image generator or video diffusion model can be swapped in, making the approach future‑proof.

Results & Findings

Visual Quality – The generated videos achieve FID scores comparable to real video clips (≈ 30) while maintaining 4K‑ish resolution, a notable jump from prior methods that often cap at 256‑512 px.
Temporal Consistency – Measured by the Temporal Warping Error (TWE), the approach reduces flicker by ~45 % relative to baseline video‑GANs.
Control Fidelity – Ablation studies show that pose errors stay under 5 mm (in 3D space) and viewpoint deviations under 2°, confirming precise controllability.
User Study – In a blind test with 50 developers, 78 % preferred ReImagine videos over competing systems for realism and smoothness.

Practical Implications

Virtual Production & Gaming – Studios can generate high‑quality character animations on‑the‑fly, reducing the need for costly motion‑capture sessions.
AR/VR Avatars – Real‑time pose updates (e.g., from a webcam) can be fed into the pipeline to render lifelike avatars that maintain visual fidelity across head‑mounted displays.
E‑commerce & Fashion – Brands can showcase garments on a virtual model from any angle or pose without filming multiple takes, accelerating catalog creation.
Content Creation Tools – Plug‑ins for Unity/Unreal or video‑editing suites could expose “pose‑to‑video” controls, empowering creators without deep ML expertise.
Research Acceleration – The released canonical dataset and compositional model provide a solid baseline for further work on controllable human synthesis, domain adaptation, or personalized avatar generation.

Limitations & Future Work

Dependence on SMPL‑X Accuracy – Errors in the underlying 3D mesh (e.g., for loose clothing or accessories) propagate to the final video, limiting fidelity for highly non‑rigid outfits.
Computational Cost – Running two diffusion models sequentially (image then video) is still GPU‑intensive; real‑time deployment will require model distillation or lighter alternatives.
Limited Multi‑Person Scenarios – The current pipeline focuses on a single subject; extending to interactions or crowd scenes remains an open challenge.
Future Directions – The authors suggest integrating physics‑based cloth simulators, exploring low‑latency diffusion variants, and expanding the dataset to cover diverse body types and cultural attire.

Authors

Zhengwentai Sun
Keru Zheng
Chenghong Li
Hongjie Liao
Xihe Yang
Heyuan Li
Yihao Zhi
Shuliang Ning
Shuguang Cui
Xiaoguang Han

Paper Information

arXiv ID: 2604.19720v1
Categories: cs.CV
Published: April 21, 2026
PDF: Download PDF

[Paper] ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Seeing Fast and Slow: Learning the Flow of Time in Videos

[Paper] Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs

[Paper] Context Unrolling in Omni Models

[Paper] Vista4D: Video Reshooting with 4D Point Clouds