[Paper] 3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation

Published: (February 3, 2026 at 12:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.03796v1

Overview

The paper presents 3DiMo, a new framework for generating human videos that can be driven by motion cues while freely changing the camera viewpoint. Instead of relying on brittle 2‑D pose tracks or heavyweight 3‑D body models (e.g., SMPL), 3DiMo learns an implicit, view‑agnostic motion representation that plugs directly into a pre‑trained video generator. This lets developers synthesize high‑fidelity, view‑adaptive human clips with just a few motion frames and optional text‑based camera commands.

Key Contributions

  • Implicit motion tokens: Introduces a compact, view‑agnostic motion encoding that is injected into a video generator via cross‑attention, avoiding explicit 3‑D reconstruction at inference time.
  • Joint encoder‑generator training: Simultaneously trains a motion encoder with a frozen, large‑scale video generator, letting the encoder inherit the generator’s spatial priors.
  • View‑rich supervision: Uses a mix of single‑view, multi‑view, and moving‑camera video data to enforce motion consistency across viewpoints.
  • Annealed SMPL guidance: Starts training with SMPL‑based geometric cues for stability, then gradually removes them so the model learns genuine 3‑D motion understanding from data.
  • Text‑driven camera control: Adds a lightweight text interface that lets users specify camera moves (e.g., “pan left 30°”) while preserving the original motion.

Methodology

  1. Motion Encoder – A lightweight transformer processes a short clip of driving frames and outputs a set of motion tokens. These tokens are deliberately designed to be view‑agnostic: they capture the underlying body dynamics without encoding a specific camera angle.
  2. Cross‑Attention Injection – The tokens are fed into a pre‑trained video generator (e.g., a diffusion or autoregressive model) through cross‑attention layers. This lets the generator modulate its spatial features according to the motion cues while preserving its learned 3‑D priors.
  3. Training Regime
    • Multi‑view data: The same motion is presented from different camera angles, forcing the encoder to produce identical tokens regardless of viewpoint.
    • Moving‑camera videos: Camera motion is explicitly varied, teaching the system to separate body motion from camera motion.
    • Auxiliary SMPL loss: Early in training, a temporary SMPL‑based loss aligns the tokens with a known 3‑D skeleton. The weight of this loss is linearly annealed to zero, after which the model relies solely on implicit cues.
  4. Text‑based Camera Prompt – A small language model maps natural‑language camera commands to latent camera parameters that are added to the generator’s conditioning, enabling on‑the‑fly viewpoint changes.

Results & Findings

  • Motion fidelity: 3DiMo reproduces the driving motion with a +15 % improvement in pose error (MPJPE) over the strongest 2‑D‑pose baselines and +8 % over SMPL‑based methods.
  • Visual quality: In user studies, participants rated 3DiMo videos 1.2× higher on realism and 1.4× higher on smoothness compared to prior work.
  • View adaptability: The system can render the same motion from arbitrary camera angles, including extreme rotations (±90°) that break conventional 2‑D‑pose pipelines.
  • Ablation: Removing the SMPL pre‑training phase degrades performance by ~6 % in motion accuracy, confirming its role as a useful “bootstrap”. Dropping multi‑view supervision leads to noticeable drift when the camera is changed at test time.

Practical Implications

  • Game & VR content creation: Artists can generate high‑quality human animations from a few reference clips and then freely reposition the virtual camera, cutting down on motion‑capture sessions.
  • Synthetic data pipelines: Researchers building training data for pose estimation or action recognition can now produce diverse viewpoint variations without expensive 3‑D reconstructions.
  • Live streaming & AR filters: Real‑time systems could embed the motion encoder to drive avatars that follow a user’s movements while the camera pans or zooms, all with low latency.
  • Text‑driven editing tools: Integrating the natural‑language camera prompt enables non‑technical users to script cinematic shots (“zoom in on the left hand”) directly in video generation interfaces.

Limitations & Future Work

  • Dependency on pre‑trained generators: The quality of 3DiMo is bounded by the underlying video generator; scaling to higher resolutions may require larger models.
  • Limited to single‑person motions: Current experiments focus on isolated humans; extending to multi‑person interactions or crowded scenes remains open.
  • Training data diversity: The view‑rich supervision relies on datasets that contain multi‑view or moving‑camera footage, which are still relatively scarce.
  • Future directions: The authors suggest exploring hierarchical motion tokens for longer sequences, integrating explicit physics constraints for more realistic dynamics, and applying the framework to other articulated objects (e.g., animals, robots).

Authors

  • Zhixue Fang
  • Xu He
  • Songlin Tang
  • Haoxian Zhang
  • Qingfeng Li
  • Xiaoqiang Liu
  • Pengfei Wan
  • Kun Gai

Paper Information

  • arXiv ID: 2602.03796v1
  • Categories: cs.CV
  • Published: February 3, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »