[Paper] Human Video Generation from a Single Image with 3D Pose and View Control

Published: 3 days ago (February 24, 2026 at 01:42 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.21188v1

Overview

The paper introduces Human Video Generation in 4D (HVG), a diffusion‑based model that can turn a single photograph of a person into a fully controllable, multi‑view video. By feeding the system a 3‑D pose and a desired camera angle, developers can synthesize realistic human motion—including clothing wrinkles that stay consistent across viewpoints—without needing any video footage as training data.

Key Contributions

Articulated Pose Modulation – a dual‑dimensional bone map that encodes 3‑D joint relationships, enabling the model to reason about self‑occlusions and maintain anatomical correctness across views.
View & Temporal Alignment – a synchronization scheme that ties the generated frames to both the reference image and the pose sequence, guaranteeing frame‑to‑frame stability and multi‑view consistency.
Progressive Spatio‑Temporal Sampling – a coarse‑to‑fine diffusion schedule that respects temporal alignment, producing smooth, long‑duration animations without flickering or jitter.
Latent Video Diffusion Architecture – operates in a compact latent space, making the generation process computationally tractable for high‑resolution human videos.
Extensive Benchmarks – quantitative and qualitative comparisons show HVG outperforms prior image‑to‑video and 3‑D human synthesis methods on realism, pose fidelity, and view consistency.

Methodology

Input Representation
- Single RGB image of a person.
- 3‑D pose skeleton (joint coordinates) supplied by a separate pose estimator.
- Desired camera view (e.g., front, side, top).
Dual‑Dimensional Bone Map
- Constructs a 2‑D map (pixel‑wise) that records the direction and length of each bone, and a parallel 3‑D map that stores the actual joint positions.
- This dual view lets the diffusion model understand how limbs should appear from any angle and how they occlude each other.
Latent Diffusion Process
- The image and bone maps are encoded into a low‑dimensional latent space using a pretrained VAE.
- A UNet‑style denoiser iteratively refines a noisy latent video conditioned on the pose sequence and view parameters.
Alignment Modules
- View Alignment: Projects the 3‑D skeleton onto the target view and aligns the latent frames to keep the same person silhouette across angles.
- Temporal Alignment: Enforces that consecutive latent frames follow the same motion trajectory, reducing temporal drift.
Progressive Sampling
- Starts with a short, low‑resolution clip to capture coarse motion, then progressively upsamples both spatially and temporally while preserving the alignment constraints.

Results & Findings

Visual Quality: HVG generates videos with sharp details, realistic cloth dynamics, and consistent lighting across viewpoints—far surpassing baselines like Imagen‑Video and Make‑It‑3D.
Pose Fidelity: Measured by MPJPE (Mean Per Joint Position Error), HVG reduces error by ~30 % compared to prior diffusion methods, indicating tighter adherence to the supplied 3‑D pose.
Multi‑View Consistency: A novel view‑consistency metric shows a 45 % improvement, confirming that the same motion looks coherent from different camera angles.
Temporal Smoothness: Temporal warping error drops dramatically, evidencing the effectiveness of progressive spatio‑temporal sampling.
Speed: Operating in latent space yields generation times of ~2–3 seconds per 2‑second clip on an RTX 3090, making it feasible for interactive prototyping.

Practical ImpImplications

Game Development & Virtual Production: Artists can quickly prototype character animations from concept art, adjusting pose and camera on the fly without hand‑animating every frame.
AR/VR Avatars: Real‑time generation of personalized avatars that react to user‑provided poses and viewpoints, enabling more immersive telepresence.
Fashion & E‑Commerce: Brands can showcase garments on a model from any angle and in motion using just a product photo, reducing the need for costly video shoots.
Content Creation Tools: Integration into video editors or AI‑assisted animation suites could let creators generate filler footage or background crowds automatically.
Research & Simulation: Provides a data‑efficient way to synthesize large volumes of labeled human motion for training other vision models (e.g., action recognition, pose estimation).

Limitations & Future Work

Pose Estimation Dependency: The quality of the output hinges on the accuracy of the supplied 3‑D pose; noisy or ambiguous poses can lead to artifacts.
Limited Interaction Modeling: Current design handles a single isolated human; extending to multi‑person scenes or object interactions remains an open challenge.
Resolution Scaling: While latent diffusion is efficient, generating ultra‑high‑resolution (4K+) videos still requires substantial GPU memory.
Generalization to Exotic Clothing: Very loose or highly reflective garments sometimes produce unrealistic wrinkles; future work could incorporate physics‑based cloth simulators as an additional conditioning signal.

Overall, HVG pushes the frontier of single‑image human video synthesis, offering developers a powerful new tool for creating controllable, multi‑view animations with surprisingly little input.

Authors

Tiantian Wang
Chun-Han Yao
Tao Hu
Mallikarjun Byrasandra Ramalinga Reddy
Ming-Hsuan Yang
Varun Jampani

Paper Information

arXiv ID: 2602.21188v1
Categories: cs.CV
Published: February 24, 2026
PDF: Download PDF

[Paper] Human Video Generation from a Single Image with 3D Pose and View Control

Overview

Key Contributions

Methodology

Results & Findings

Practical ImpImplications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] MediX-R1: Open Ended Medical Reinforcement Learning

[Paper] VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale

[Paper] SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

[Paper] A Dataset is Worth 1 MB