[Paper] Human Video Generation from a Single Image with 3D Pose and View Control
Source: arXiv - 2602.21188v1
Overview
The paper introduces Human Video Generation in 4D (HVG), a diffusion‑based model that can turn a single photograph of a person into a fully controllable, multi‑view video. By feeding the system a 3‑D pose and a desired camera angle, developers can synthesize realistic human motion—including clothing wrinkles that stay consistent across viewpoints—without needing any video footage as training data.
Key Contributions
- Articulated Pose Modulation – a dual‑dimensional bone map that encodes 3‑D joint relationships, enabling the model to reason about self‑occlusions and maintain anatomical correctness across views.
- View & Temporal Alignment – a synchronization scheme that ties the generated frames to both the reference image and the pose sequence, guaranteeing frame‑to‑frame stability and multi‑view consistency.
- Progressive Spatio‑Temporal Sampling – a coarse‑to‑fine diffusion schedule that respects temporal alignment, producing smooth, long‑duration animations without flickering or jitter.
- Latent Video Diffusion Architecture – operates in a compact latent space, making the generation process computationally tractable for high‑resolution human videos.
- Extensive Benchmarks – quantitative and qualitative comparisons show HVG outperforms prior image‑to‑video and 3‑D human synthesis methods on realism, pose fidelity, and view consistency.
Methodology
-
Input Representation
- Single RGB image of a person.
- 3‑D pose skeleton (joint coordinates) supplied by a separate pose estimator.
- Desired camera view (e.g., front, side, top).
-
Dual‑Dimensional Bone Map
- Constructs a 2‑D map (pixel‑wise) that records the direction and length of each bone, and a parallel 3‑D map that stores the actual joint positions.
- This dual view lets the diffusion model understand how limbs should appear from any angle and how they occlude each other.
-
Latent Diffusion Process
- The image and bone maps are encoded into a low‑dimensional latent space using a pretrained VAE.
- A UNet‑style denoiser iteratively refines a noisy latent video conditioned on the pose sequence and view parameters.
-
Alignment Modules
- View Alignment: Projects the 3‑D skeleton onto the target view and aligns the latent frames to keep the same person silhouette across angles.
- Temporal Alignment: Enforces that consecutive latent frames follow the same motion trajectory, reducing temporal drift.
-
Progressive Sampling
- Starts with a short, low‑resolution clip to capture coarse motion, then progressively upsamples both spatially and temporally while preserving the alignment constraints.
Results & Findings
- Visual Quality: HVG generates videos with sharp details, realistic cloth dynamics, and consistent lighting across viewpoints—far surpassing baselines like Imagen‑Video and Make‑It‑3D.
- Pose Fidelity: Measured by MPJPE (Mean Per Joint Position Error), HVG reduces error by ~30 % compared to prior diffusion methods, indicating tighter adherence to the supplied 3‑D pose.
- Multi‑View Consistency: A novel view‑consistency metric shows a 45 % improvement, confirming that the same motion looks coherent from different camera angles.
- Temporal Smoothness: Temporal warping error drops dramatically, evidencing the effectiveness of progressive spatio‑temporal sampling.
- Speed: Operating in latent space yields generation times of ~2–3 seconds per 2‑second clip on an RTX 3090, making it feasible for interactive prototyping.
Practical ImpImplications
- Game Development & Virtual Production: Artists can quickly prototype character animations from concept art, adjusting pose and camera on the fly without hand‑animating every frame.
- AR/VR Avatars: Real‑time generation of personalized avatars that react to user‑provided poses and viewpoints, enabling more immersive telepresence.
- Fashion & E‑Commerce: Brands can showcase garments on a model from any angle and in motion using just a product photo, reducing the need for costly video shoots.
- Content Creation Tools: Integration into video editors or AI‑assisted animation suites could let creators generate filler footage or background crowds automatically.
- Research & Simulation: Provides a data‑efficient way to synthesize large volumes of labeled human motion for training other vision models (e.g., action recognition, pose estimation).
Limitations & Future Work
- Pose Estimation Dependency: The quality of the output hinges on the accuracy of the supplied 3‑D pose; noisy or ambiguous poses can lead to artifacts.
- Limited Interaction Modeling: Current design handles a single isolated human; extending to multi‑person scenes or object interactions remains an open challenge.
- Resolution Scaling: While latent diffusion is efficient, generating ultra‑high‑resolution (4K+) videos still requires substantial GPU memory.
- Generalization to Exotic Clothing: Very loose or highly reflective garments sometimes produce unrealistic wrinkles; future work could incorporate physics‑based cloth simulators as an additional conditioning signal.
Overall, HVG pushes the frontier of single‑image human video synthesis, offering developers a powerful new tool for creating controllable, multi‑view animations with surprisingly little input.
Authors
- Tiantian Wang
- Chun-Han Yao
- Tao Hu
- Mallikarjun Byrasandra Ramalinga Reddy
- Ming-Hsuan Yang
- Varun Jampani
Paper Information
- arXiv ID: 2602.21188v1
- Categories: cs.CV
- Published: February 24, 2026
- PDF: Download PDF