[Paper] Human Video Generation from a Single Image with 3D Pose and View Control

Published: (February 24, 2026 at 01:42 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.21188v1

Overview

The paper introduces Human Video Generation in 4D (HVG), a diffusion‑based model that can turn a single photograph of a person into a fully controllable, multi‑view video. By feeding the system a 3‑D pose and a desired camera angle, developers can synthesize realistic human motion—including clothing wrinkles that stay consistent across viewpoints—without needing any video footage as training data.

Key Contributions

  • Articulated Pose Modulation – a dual‑dimensional bone map that encodes 3‑D joint relationships, enabling the model to reason about self‑occlusions and maintain anatomical correctness across views.
  • View & Temporal Alignment – a synchronization scheme that ties the generated frames to both the reference image and the pose sequence, guaranteeing frame‑to‑frame stability and multi‑view consistency.
  • Progressive Spatio‑Temporal Sampling – a coarse‑to‑fine diffusion schedule that respects temporal alignment, producing smooth, long‑duration animations without flickering or jitter.
  • Latent Video Diffusion Architecture – operates in a compact latent space, making the generation process computationally tractable for high‑resolution human videos.
  • Extensive Benchmarks – quantitative and qualitative comparisons show HVG outperforms prior image‑to‑video and 3‑D human synthesis methods on realism, pose fidelity, and view consistency.

Methodology

  1. Input Representation

    • Single RGB image of a person.
    • 3‑D pose skeleton (joint coordinates) supplied by a separate pose estimator.
    • Desired camera view (e.g., front, side, top).
  2. Dual‑Dimensional Bone Map

    • Constructs a 2‑D map (pixel‑wise) that records the direction and length of each bone, and a parallel 3‑D map that stores the actual joint positions.
    • This dual view lets the diffusion model understand how limbs should appear from any angle and how they occlude each other.
  3. Latent Diffusion Process

    • The image and bone maps are encoded into a low‑dimensional latent space using a pretrained VAE.
    • A UNet‑style denoiser iteratively refines a noisy latent video conditioned on the pose sequence and view parameters.
  4. Alignment Modules

    • View Alignment: Projects the 3‑D skeleton onto the target view and aligns the latent frames to keep the same person silhouette across angles.
    • Temporal Alignment: Enforces that consecutive latent frames follow the same motion trajectory, reducing temporal drift.
  5. Progressive Sampling

    • Starts with a short, low‑resolution clip to capture coarse motion, then progressively upsamples both spatially and temporally while preserving the alignment constraints.

Results & Findings

  • Visual Quality: HVG generates videos with sharp details, realistic cloth dynamics, and consistent lighting across viewpoints—far surpassing baselines like Imagen‑Video and Make‑It‑3D.
  • Pose Fidelity: Measured by MPJPE (Mean Per Joint Position Error), HVG reduces error by ~30 % compared to prior diffusion methods, indicating tighter adherence to the supplied 3‑D pose.
  • Multi‑View Consistency: A novel view‑consistency metric shows a 45 % improvement, confirming that the same motion looks coherent from different camera angles.
  • Temporal Smoothness: Temporal warping error drops dramatically, evidencing the effectiveness of progressive spatio‑temporal sampling.
  • Speed: Operating in latent space yields generation times of ~2–3 seconds per 2‑second clip on an RTX 3090, making it feasible for interactive prototyping.

Practical ImpImplications

  • Game Development & Virtual Production: Artists can quickly prototype character animations from concept art, adjusting pose and camera on the fly without hand‑animating every frame.
  • AR/VR Avatars: Real‑time generation of personalized avatars that react to user‑provided poses and viewpoints, enabling more immersive telepresence.
  • Fashion & E‑Commerce: Brands can showcase garments on a model from any angle and in motion using just a product photo, reducing the need for costly video shoots.
  • Content Creation Tools: Integration into video editors or AI‑assisted animation suites could let creators generate filler footage or background crowds automatically.
  • Research & Simulation: Provides a data‑efficient way to synthesize large volumes of labeled human motion for training other vision models (e.g., action recognition, pose estimation).

Limitations & Future Work

  • Pose Estimation Dependency: The quality of the output hinges on the accuracy of the supplied 3‑D pose; noisy or ambiguous poses can lead to artifacts.
  • Limited Interaction Modeling: Current design handles a single isolated human; extending to multi‑person scenes or object interactions remains an open challenge.
  • Resolution Scaling: While latent diffusion is efficient, generating ultra‑high‑resolution (4K+) videos still requires substantial GPU memory.
  • Generalization to Exotic Clothing: Very loose or highly reflective garments sometimes produce unrealistic wrinkles; future work could incorporate physics‑based cloth simulators as an additional conditioning signal.

Overall, HVG pushes the frontier of single‑image human video synthesis, offering developers a powerful new tool for creating controllable, multi‑view animations with surprisingly little input.

Authors

  • Tiantian Wang
  • Chun-Han Yao
  • Tao Hu
  • Mallikarjun Byrasandra Ramalinga Reddy
  • Ming-Hsuan Yang
  • Varun Jampani

Paper Information

  • arXiv ID: 2602.21188v1
  • Categories: cs.CV
  • Published: February 24, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] A Dataset is Worth 1 MB

A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate on divers...