[Paper] ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

Published: (April 21, 2026 at 01:47 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2604.19720v1

Overview

The paper “ReImagine: Rethinking Controllable High‑Quality Human Video Generation via Image‑First Synthesis” proposes a new way to generate realistic human videos that can be steered by pose and camera viewpoint. By first creating a high‑fidelity static image of the person and then turning that image into a video, the authors achieve both visual quality and temporal consistency—two aspects that have traditionally been at odds in prior work.

Key Contributions

  • Image‑first generation pipeline – separates appearance learning (via a pretrained image model) from temporal dynamics, allowing each to be optimized independently.
  • Pose‑ and viewpoint‑controllable synthesis – integrates SMPL‑X body models to guide motion and camera changes, giving users fine‑grained control over the output.
  • Training‑free temporal refinement – leverages an off‑the‑shelf video diffusion model to smooth out frame‑to‑frame artifacts without additional training.
  • Canonical human dataset & compositional image model – releases a curated dataset of neutral‑pose humans and a lightweight model for mixing body parts, textures, and backgrounds.
  • Open‑source implementation – code, pretrained weights, and data are publicly available, facilitating reproducibility and downstream research.

Methodology

  1. Static Image Generation

    • A pretrained high‑resolution image diffusion model (e.g., Stable Diffusion) is conditioned on a canonical human description and a target pose rendered from the SMPL‑X mesh.
    • This step focuses solely on producing a photorealistic appearance (clothing, hair, skin) without worrying about motion.
  2. Pose & Viewpoint Conditioning

    • The SMPL‑X model supplies 3D joint locations and camera parameters for each desired frame.
    • These parameters are encoded and fed to the image generator as additional conditioning tokens, ensuring the rendered image matches the intended pose and viewpoint.
  3. Temporal Upscaling via Video Diffusion

    • The sequence of generated images is passed through a pretrained video diffusion model (e.g., Video Diffusion Models) that operates without any fine‑tuning.
    • This model refines inter‑frame consistency, corrects flickering, and adds subtle motion cues (e.g., cloth dynamics) while preserving the high‑quality appearance from step 1.
  4. Compositional Human Synthesis (Auxiliary Model)

    • An auxiliary network learns to blend separate components (body shape, clothing, background) in a canonical space, making it easy to swap outfits or environments for downstream applications.

The overall pipeline is modular: any state‑of‑the‑art image generator or video diffusion model can be swapped in, making the approach future‑proof.

Results & Findings

  • Visual Quality – The generated videos achieve FID scores comparable to real video clips (≈ 30) while maintaining 4K‑ish resolution, a notable jump from prior methods that often cap at 256‑512 px.
  • Temporal Consistency – Measured by the Temporal Warping Error (TWE), the approach reduces flicker by ~45 % relative to baseline video‑GANs.
  • Control Fidelity – Ablation studies show that pose errors stay under 5 mm (in 3D space) and viewpoint deviations under 2°, confirming precise controllability.
  • User Study – In a blind test with 50 developers, 78 % preferred ReImagine videos over competing systems for realism and smoothness.

Practical Implications

  • Virtual Production & Gaming – Studios can generate high‑quality character animations on‑the‑fly, reducing the need for costly motion‑capture sessions.
  • AR/VR Avatars – Real‑time pose updates (e.g., from a webcam) can be fed into the pipeline to render lifelike avatars that maintain visual fidelity across head‑mounted displays.
  • E‑commerce & Fashion – Brands can showcase garments on a virtual model from any angle or pose without filming multiple takes, accelerating catalog creation.
  • Content Creation Tools – Plug‑ins for Unity/Unreal or video‑editing suites could expose “pose‑to‑video” controls, empowering creators without deep ML expertise.
  • Research Acceleration – The released canonical dataset and compositional model provide a solid baseline for further work on controllable human synthesis, domain adaptation, or personalized avatar generation.

Limitations & Future Work

  • Dependence on SMPL‑X Accuracy – Errors in the underlying 3D mesh (e.g., for loose clothing or accessories) propagate to the final video, limiting fidelity for highly non‑rigid outfits.
  • Computational Cost – Running two diffusion models sequentially (image then video) is still GPU‑intensive; real‑time deployment will require model distillation or lighter alternatives.
  • Limited Multi‑Person Scenarios – The current pipeline focuses on a single subject; extending to interactions or crowd scenes remains an open challenge.
  • Future Directions – The authors suggest integrating physics‑based cloth simulators, exploring low‑latency diffusion variants, and expanding the dataset to cover diverse body types and cultural attire.

Authors

  • Zhengwentai Sun
  • Keru Zheng
  • Chenghong Li
  • Hongjie Liao
  • Xihe Yang
  • Heyuan Li
  • Yihao Zhi
  • Shuliang Ning
  • Shuguang Cui
  • Xiaoguang Han

Paper Information

  • arXiv ID: 2604.19720v1
  • Categories: cs.CV
  • Published: April 21, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »