[Paper] ActCam: Zero-Shot Joint Camera and 3D Motion Control for Video Generation

Published: (May 7, 2026 at 01:59 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2605.06667v1

Overview

ActCam is a zero‑shot technique that lets you generate videos where both the actor’s motion and the camera’s movement are fully controllable, without any extra training. By leveraging existing image‑to‑video diffusion models, it can transplant a character’s pose sequence into a new scene while following a user‑specified camera trajectory, opening up new possibilities for creators, game developers, and AR/VR pipelines.

Key Contributions

  • Joint motion‑and‑camera control in a single diffusion pass, handling both intrinsic (zoom, focal length) and extrinsic (position, orientation) camera parameters.
  • Zero‑shot workflow: works with any pretrained diffusion model that accepts depth and pose conditioning—no fine‑tuning required.
  • Geometrically consistent conditioning: automatically generates per‑frame depth maps that stay coherent across the whole video, preserving scene structure.
  • Two‑phase conditioning schedule: early denoising steps use pose + sparse depth to lock in layout, later steps drop depth to let the model add high‑frequency detail without over‑constraining.
  • Extensive benchmark evaluation showing superior camera adherence and motion fidelity compared to pose‑only baselines and prior joint control methods.

Methodology

  1. Input preparation

    • A source video provides the actor’s pose trajectory (extracted with a pose estimator).
    • A target camera path is defined by a sequence of extrinsic/intrinsic parameters (e.g., a 3‑D spline or keyframes).
  2. Condition generation

    • For each frame, ActCam synthesizes a depth map that aligns the actor’s pose with the desired camera view, ensuring that the 3‑D layout stays consistent over time.
    • The depth maps are sparse (only a few depth cues) to keep the diffusion model flexible.
  3. Diffusion sampling with staged guidance

    • Phase 1 (coarse) – During the early denoising steps, the model receives both pose and depth conditions, anchoring the overall geometry and camera perspective.
    • Phase 2 (refinement) – In later steps, depth conditioning is removed; only the pose guides the generation, allowing the model to flesh out textures, lighting, and motion details without being locked to a rigid depth field.
  4. Output

    • The result is a video where the character follows the original motion, but the camera follows the user‑specified trajectory, all produced in a single forward pass of the diffusion model.

Results & Findings

  • Quantitative metrics (e.g., camera pose error, pose similarity) show a 30‑40 % reduction in camera deviation compared to pose‑only baselines.
  • Motion fidelity (measured by keypoint distance) improves modestly, indicating that the added camera conditioning does not harm the actor’s dynamics.
  • Human preference studies: participants chose ActCam videos over competing methods 65 % of the time, especially when the camera underwent large viewpoint swings.
  • The approach remains robust across diverse character styles (humans, animals, stylized avatars) and challenging scene layouts (indoor clutter, outdoor depth variations).

Practical Implications

  • Content creation pipelines: filmmakers and game studios can script camera moves and reuse existing motion capture clips without re‑recording or retraining models.
  • AR/VR experiences: developers can generate on‑the‑fly cinematic replays that match a user’s head‑tracked viewpoint, keeping the virtual actor anchored correctly.
  • Rapid prototyping: product designers can visualize a prototype being used from any angle by simply providing a motion clip and a camera path, cutting down on costly motion‑capture sessions.
  • Education & training: instructors can demonstrate biomechanical motions from arbitrary viewpoints, aiding in sports analysis or medical training.

Limitations & Future Work

  • Depth sparsity: while sufficient for many scenes, extremely complex geometry may require denser depth cues or a dedicated depth predictor.
  • Dependence on pose extraction quality: noisy or missing keypoints can propagate errors into the generated video.
  • Real‑time generation: current diffusion sampling remains computationally heavy; accelerating inference (e.g., via distillation or latent‑space shortcuts) is an open avenue.
  • Broader modality support: extending the framework to handle audio‑driven lip sync or multi‑character interactions would further broaden its applicability.

ActCam demonstrates that with clever conditioning and staged guidance, we can achieve fine‑grained, joint control over motion and cinematography without the cost of training new models—an exciting step toward more flexible, creator‑friendly video synthesis.

Authors

  • Omar El Khalifi
  • Thomas Rossi
  • Oscar Fossey
  • Thibault Fouque
  • Ulysse Mizrahi
  • Philip Torr
  • Ivan Laptev
  • Fabio Pizzati
  • Baptiste Bellot-Gurlet

Paper Information

  • arXiv ID: 2605.06667v1
  • Categories: cs.CV, cs.AI, cs.LG
  • Published: May 7, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Normalizing Trajectory Models

Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coar...