[Paper] FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning

Published: 15 hours ago (March 5, 2026 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2603.05506v1

Overview

FaceCam is a new system that lets you re‑animate a portrait video (e.g., a vlog or interview) as if it had been shot with a moving camera, even when the original footage was captured with a static, single‑lens camera. By introducing a scale‑aware conditioning scheme, the authors eliminate the geometric glitches that have plagued earlier video‑generation approaches, making the output look like it was really filmed with a professional dolly or crane.

Key Contributions

Scale‑aware camera representation: A deterministic way to encode camera motions that respects the subject’s size, removing the “scale ambiguity” that caused distortions in prior work.
Hybrid training data pipeline: Combines high‑quality multi‑view studio captures with everyday monocular videos, expanding the model’s ability to handle both controlled and wild settings.
Two novel data‑generation tricks:
1. Synthetic camera motion – programmatically creates virtual camera paths for static videos.
2. Multi‑shot stitching – stitches together short clips recorded from a fixed position to simulate continuous motion.
Video generation model tuned for portrait preservation: The network is explicitly regularized to keep the person’s identity, facial expressions, and subtle motions intact while the virtual camera moves.
Comprehensive evaluation: Demonstrates state‑of‑the‑art controllability and visual fidelity on the Ava‑256 benchmark and a wide range of in‑the‑wild portrait clips.

Methodology

Scale‑aware conditioning
- The authors encode each desired camera pose (position, orientation, focal length) together with a scale factor derived from the subject’s face bounding box.
- This factor tells the generator how large the person should appear at any point along the trajectory, preventing the “zoom‑in‑zoom‑out” artifacts that happen when the model guesses the scale from ambiguous cues.
Training data preparation
- Studio data: Multi‑camera rigs capture the same person from many angles, providing ground‑truth camera parameters.
- Monocular data: Ordinary videos are augmented using the two tricks above, turning a static clip into a pseudo‑dynamic sequence with known virtual camera paths.
Video generation backbone
- A diffusion‑based video synthesis model (similar to Imagen Video) receives the concatenated frame embeddings and the scale‑aware pose conditioning at each timestep.
- An identity‑preserving loss (based on a pre‑trained face recognition network) and a motion‑consistency loss keep the subject’s look and gestures stable across frames.
Inference
- Users supply a single portrait video and a desired camera trajectory (e.g., “slow push‑in from left to right”).
- The system computes the scale‑aware conditioning on‑the‑fly and runs the diffusion model to output a smooth, controllable video.

Results & Findings

Metric (higher is better)	Baseline (generic control)	FaceCam
Camera controllability (PSNR‑based)	22.1 dB	27.8 dB
Visual quality (LPIPS)	0.31	0.18
Identity preservation (ID‑Score)	0.71	0.94
Motion fidelity (FVD)	210	112

Qualitative: Side‑by‑side videos show that FaceCam maintains crisp facial details and natural background parallax, while baselines often produce stretched faces or wobbling backgrounds.
Generalization: The model works on both studio‑grade lighting and noisy, handheld phone footage, confirming that the synthetic motion and stitching tricks successfully bridge the domain gap.
User study: Over 80 % of participants preferred FaceCam’s outputs when asked to rank realism and identity consistency.

Practical Implications

Content creators: Turn a cheap selfie or a static interview into a dynamic shot without reshooting, saving time and production budget.
Virtual production: Generate camera moves for virtual avatars or digital twins when only a single video of the actor exists.
Live streaming & AR: Apply real‑time scale‑aware camera effects (e.g., virtual dolly‑in) to live portrait streams, enhancing viewer engagement.
Post‑production tools: Integrate FaceCam as a plug‑in for video editors (Premiere, DaVinci Resolve) to give editors fine‑grained control over camera paths on existing footage.

Limitations & Future Work

Depth ambiguity: While scale‑aware conditioning mitigates size errors, the system still lacks true 3‑D scene understanding, so extreme camera excursions (large parallax) can reveal background artifacts.
Computation cost: The diffusion backbone requires several seconds per frame on a high‑end GPU, limiting real‑time applications.
Occlusion handling: Heavy occlusions (e.g., hands covering the face) sometimes cause identity drift.
Future directions suggested by the authors include integrating lightweight depth estimation to improve background parallax, optimizing the diffusion pipeline for faster inference, and extending the framework to multi‑person group shots.

Authors

Weijie Lyu
Ming-Hsuan Yang
Zhixin Shu

Paper Information

arXiv ID: 2603.05506v1
Categories: cs.CV
Published: March 5, 2026
PDF: Download PDF

[Paper] FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Transformer-Based Inpainting for Real-Time 3D Streaming in Sparse Multi-Camera Setups

[Paper] Accelerating Text-to-Video Generation with Calibrated Sparse Attention

[Paper] Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline

[Paper] Towards 3D Scene Understanding of Gas Plumes in LWIR Hyperspectral Images Using Neural Radiance Fields