[Paper] ActCam: Zero-Shot Joint Camera and 3D Motion Control for Video Generation

Published: 3 days ago (May 7, 2026 at 01:59 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.06667v1

Overview

ActCam is a zero‑shot technique that lets you generate videos where both the actor’s motion and the camera’s movement are fully controllable, without any extra training. By leveraging existing image‑to‑video diffusion models, it can transplant a character’s pose sequence into a new scene while following a user‑specified camera trajectory, opening up new possibilities for creators, game developers, and AR/VR pipelines.

Key Contributions

Joint motion‑and‑camera control in a single diffusion pass, handling both intrinsic (zoom, focal length) and extrinsic (position, orientation) camera parameters.
Zero‑shot workflow: works with any pretrained diffusion model that accepts depth and pose conditioning—no fine‑tuning required.
Geometrically consistent conditioning: automatically generates per‑frame depth maps that stay coherent across the whole video, preserving scene structure.
Two‑phase conditioning schedule: early denoising steps use pose + sparse depth to lock in layout, later steps drop depth to let the model add high‑frequency detail without over‑constraining.
Extensive benchmark evaluation showing superior camera adherence and motion fidelity compared to pose‑only baselines and prior joint control methods.

Methodology

Input preparation
- A source video provides the actor’s pose trajectory (extracted with a pose estimator).
- A target camera path is defined by a sequence of extrinsic/intrinsic parameters (e.g., a 3‑D spline or keyframes).
Condition generation
- For each frame, ActCam synthesizes a depth map that aligns the actor’s pose with the desired camera view, ensuring that the 3‑D layout stays consistent over time.
- The depth maps are sparse (only a few depth cues) to keep the diffusion model flexible.
Diffusion sampling with staged guidance
- Phase 1 (coarse) – During the early denoising steps, the model receives both pose and depth conditions, anchoring the overall geometry and camera perspective.
- Phase 2 (refinement) – In later steps, depth conditioning is removed; only the pose guides the generation, allowing the model to flesh out textures, lighting, and motion details without being locked to a rigid depth field.
Output
- The result is a video where the character follows the original motion, but the camera follows the user‑specified trajectory, all produced in a single forward pass of the diffusion model.

Results & Findings

Quantitative metrics (e.g., camera pose error, pose similarity) show a 30‑40 % reduction in camera deviation compared to pose‑only baselines.
Motion fidelity (measured by keypoint distance) improves modestly, indicating that the added camera conditioning does not harm the actor’s dynamics.
Human preference studies: participants chose ActCam videos over competing methods 65 % of the time, especially when the camera underwent large viewpoint swings.
The approach remains robust across diverse character styles (humans, animals, stylized avatars) and challenging scene layouts (indoor clutter, outdoor depth variations).

Practical Implications

Content creation pipelines: filmmakers and game studios can script camera moves and reuse existing motion capture clips without re‑recording or retraining models.
AR/VR experiences: developers can generate on‑the‑fly cinematic replays that match a user’s head‑tracked viewpoint, keeping the virtual actor anchored correctly.
Rapid prototyping: product designers can visualize a prototype being used from any angle by simply providing a motion clip and a camera path, cutting down on costly motion‑capture sessions.
Education & training: instructors can demonstrate biomechanical motions from arbitrary viewpoints, aiding in sports analysis or medical training.

Limitations & Future Work

Depth sparsity: while sufficient for many scenes, extremely complex geometry may require denser depth cues or a dedicated depth predictor.
Dependence on pose extraction quality: noisy or missing keypoints can propagate errors into the generated video.
Real‑time generation: current diffusion sampling remains computationally heavy; accelerating inference (e.g., via distillation or latent‑space shortcuts) is an open avenue.
Broader modality support: extending the framework to handle audio‑driven lip sync or multi‑character interactions would further broaden its applicability.

ActCam demonstrates that with clever conditioning and staged guidance, we can achieve fine‑grained, joint control over motion and cinematography without the cost of training new models—an exciting step toward more flexible, creator‑friendly video synthesis.

Authors

Omar El Khalifi
Thomas Rossi
Oscar Fossey
Thibault Fouque
Ulysse Mizrahi
Philip Torr
Ivan Laptev
Fabio Pizzati
Baptiste Bellot-Gurlet

Paper Information

arXiv ID: 2605.06667v1
Categories: cs.CV, cs.AI, cs.LG
Published: May 7, 2026
PDF: Download PDF

[Paper] ActCam: Zero-Shot Joint Camera and 3D Motion Control for Video Generation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Normalizing Trajectory Models

[Paper] EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction

[Paper] Flow-OPD: On-Policy Distillation for Flow Matching Models

[Paper] SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation