[Paper] Generative Video Motion Editing with 3D Point Tracks

Published: 3 days ago (December 1, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.02015v1

Overview

The paper introduces a track‑conditioned video‑to‑video (V2V) generation framework that lets users edit both camera and object motions in existing footage. By leveraging sparse 3D point tracks as a bridge between the source video and a desired motion trajectory, the system can re‑animate scenes with realistic depth handling, occlusion reasoning, and temporal coherence—something that prior image‑to‑video or V2V methods struggled to achieve.

Key Contributions

3D Point‑Track Conditioning: Uses paired 3D point tracks (source ↔ target) to inject explicit depth cues into the generation pipeline, enabling accurate motion transfer and occlusion handling.
Joint Camera & Object Editing: Supports simultaneous manipulation of global camera motion and local object dynamics within a single model.
Two‑Stage Training Regime: First pre‑trains on large synthetic datasets for robust geometry learning, then fine‑tunes on real video data to capture natural appearance variations.
Versatile Motion Controls: Demonstrates motion transfer, non‑rigid deformation, and combined camera/object transformations with a single inference pass.
Sparse Correspondence Transfer: Achieves high‑fidelity results while only requiring a modest number of 3D tracks, reducing annotation overhead compared with dense flow methods.

Methodology

Input Representation
- Source video (V_s) (RGB frames).
- 3D point tracks ({p_i^s(t)}) extracted from (V_s) (e.g., via structure‑from‑motion or depth‑aware trackers).
- Target tracks ({p_i^t(t)}) that encode the desired motion (can be hand‑crafted, transferred from another clip, or generated procedurally).
Track‑Conditioned Generator
- A spatio‑temporal UNet processes each frame while receiving a track embedding that encodes the relative 3D displacement (\Delta p_i(t) = p_i^t(t) - p_i^s(t)).
- The embedding is broadcast spatially, allowing the network to modulate pixel‑level synthesis based on depth‑aware motion cues.
Depth‑Aware Occlusion Handling
- Because tracks live in 3D, the model can infer depth ordering: points that move behind others trigger appropriate occlusion masks, preventing ghosting artifacts common in 2D‑track methods.
Training Pipeline
- Stage 1 (Synthetic): Rendered scenes with known geometry and motion provide ground‑truth 3D tracks, letting the network learn to respect depth and motion consistency.
- Stage 2 (Real): Fine‑tune on real video clips where 3D tracks are estimated (e.g., COLMAP + optical flow). A self‑supervised reconstruction loss plus adversarial video realism loss guide the model.
Inference
- Users supply a source clip and a set of target 3D tracks (or a motion‑transfer source). The generator outputs a new video that follows the prescribed motion while preserving the original scene’s look and feel.

Results & Findings

Experiment	Metric (higher = better)	Outcome
Motion Transfer Accuracy (3D‑track vs. 2D‑track)	PSNR / SSIM	+2.8 dB PSNR, +0.07 SSIM improvement with 3D tracks
Occlusion Consistency (temporal flicker)	Temporal Warping Error	35 % reduction vs. baseline V2V
User Study (realism & control)	Preference Rate	78 % of participants preferred the 3D‑track system for fine‑grained edits
Ablation (no depth cue)	Visual Artifacts	Noticeable depth ordering errors and ghosting in 30 % of frames

The authors demonstrate a range of edits: rotating the camera around a moving car while preserving the car’s trajectory, transferring a dancer’s motion onto a different performer, and applying non‑rigid deformations (e.g., stretching a flag) without breaking scene coherence.

Practical Implications

Post‑Production & VFX: Editors can now retarget camera moves or object actions without re‑shooting or manually rotoscoping, dramatically cutting down on labor‑intensive compositing.
AR/VR Content Creation: Developers can generate immersive video assets that adapt to user‑driven camera paths, thanks to the depth‑aware motion control.
Game Asset Pipeline: Motion capture data can be transferred onto existing video footage to prototype cinematic cut‑scenes quickly.
Automated Video Personalization: Brands could automatically re‑orient product videos (e.g., rotating a smartphone) to match different ad formats while preserving realistic lighting and occlusions.
Open‑Source Tooling: Because the method relies on sparse 3D tracks—obtainable via off‑the‑shelf SfM libraries—the approach can be integrated into existing video editing suites with modest engineering effort.

Limitations & Future Work

Track Acquisition Overhead: While sparse, generating accurate 3D tracks still requires reliable structure‑from‑motion pipelines; failure cases (low texture, fast motion) can degrade results.
Complex Non‑Rigid Motions: Extremely high‑frequency deformations (e.g., water splashes) remain challenging due to the limited granularity of sparse tracks.
Scalability to Long Clips: Temporal memory is bounded; very long sequences may need chunked processing, potentially introducing seams.
Future Directions: The authors suggest exploring learned track inference (jointly estimating 3D tracks and video synthesis), interactive UI tools for on‑the‑fly track editing, and extending the framework to multi‑camera setups for stereoscopic or 360° content.

Authors

Yao-Chih Lee
Zhoutong Zhang
Jiahui Huang
Jui-Hsien Wang
Joon-Young Lee
Jia-Bin Huang
Eli Shechtman
Zhengqi Li

Paper Information

arXiv ID: 2512.02015v1
Categories: cs.CV
Published: December 1, 2025
PDF: Download PDF

[Paper] Generative Video Motion Editing with 3D Point Tracks

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] The Universal Weight Subspace Hypothesis

[Paper] Light-X: Generative 4D Video Rendering with Camera and Illumination Control

[Paper] Value Gradient Guidance for Flow Matching Alignment

[Paper] Deep infant brain segmentation from multi-contrast MRI