[Paper] Generative Video Motion Editing with 3D Point Tracks
Source: arXiv - 2512.02015v1
Overview
The paper introduces a track‑conditioned video‑to‑video (V2V) generation framework that lets users edit both camera and object motions in existing footage. By leveraging sparse 3D point tracks as a bridge between the source video and a desired motion trajectory, the system can re‑animate scenes with realistic depth handling, occlusion reasoning, and temporal coherence—something that prior image‑to‑video or V2V methods struggled to achieve.
Key Contributions
- 3D Point‑Track Conditioning: Uses paired 3D point tracks (source ↔ target) to inject explicit depth cues into the generation pipeline, enabling accurate motion transfer and occlusion handling.
- Joint Camera & Object Editing: Supports simultaneous manipulation of global camera motion and local object dynamics within a single model.
- Two‑Stage Training Regime: First pre‑trains on large synthetic datasets for robust geometry learning, then fine‑tunes on real video data to capture natural appearance variations.
- Versatile Motion Controls: Demonstrates motion transfer, non‑rigid deformation, and combined camera/object transformations with a single inference pass.
- Sparse Correspondence Transfer: Achieves high‑fidelity results while only requiring a modest number of 3D tracks, reducing annotation overhead compared with dense flow methods.
Methodology
-
Input Representation
- Source video (V_s) (RGB frames).
- 3D point tracks ({p_i^s(t)}) extracted from (V_s) (e.g., via structure‑from‑motion or depth‑aware trackers).
- Target tracks ({p_i^t(t)}) that encode the desired motion (can be hand‑crafted, transferred from another clip, or generated procedurally).
-
Track‑Conditioned Generator
- A spatio‑temporal UNet processes each frame while receiving a track embedding that encodes the relative 3D displacement (\Delta p_i(t) = p_i^t(t) - p_i^s(t)).
- The embedding is broadcast spatially, allowing the network to modulate pixel‑level synthesis based on depth‑aware motion cues.
-
Depth‑Aware Occlusion Handling
- Because tracks live in 3D, the model can infer depth ordering: points that move behind others trigger appropriate occlusion masks, preventing ghosting artifacts common in 2D‑track methods.
-
Training Pipeline
- Stage 1 (Synthetic): Rendered scenes with known geometry and motion provide ground‑truth 3D tracks, letting the network learn to respect depth and motion consistency.
- Stage 2 (Real): Fine‑tune on real video clips where 3D tracks are estimated (e.g., COLMAP + optical flow). A self‑supervised reconstruction loss plus adversarial video realism loss guide the model.
-
Inference
- Users supply a source clip and a set of target 3D tracks (or a motion‑transfer source). The generator outputs a new video that follows the prescribed motion while preserving the original scene’s look and feel.
Results & Findings
| Experiment | Metric (higher = better) | Outcome |
|---|---|---|
| Motion Transfer Accuracy (3D‑track vs. 2D‑track) | PSNR / SSIM | +2.8 dB PSNR, +0.07 SSIM improvement with 3D tracks |
| Occlusion Consistency (temporal flicker) | Temporal Warping Error | 35 % reduction vs. baseline V2V |
| User Study (realism & control) | Preference Rate | 78 % of participants preferred the 3D‑track system for fine‑grained edits |
| Ablation (no depth cue) | Visual Artifacts | Noticeable depth ordering errors and ghosting in 30 % of frames |
The authors demonstrate a range of edits: rotating the camera around a moving car while preserving the car’s trajectory, transferring a dancer’s motion onto a different performer, and applying non‑rigid deformations (e.g., stretching a flag) without breaking scene coherence.
Practical Implications
- Post‑Production & VFX: Editors can now retarget camera moves or object actions without re‑shooting or manually rotoscoping, dramatically cutting down on labor‑intensive compositing.
- AR/VR Content Creation: Developers can generate immersive video assets that adapt to user‑driven camera paths, thanks to the depth‑aware motion control.
- Game Asset Pipeline: Motion capture data can be transferred onto existing video footage to prototype cinematic cut‑scenes quickly.
- Automated Video Personalization: Brands could automatically re‑orient product videos (e.g., rotating a smartphone) to match different ad formats while preserving realistic lighting and occlusions.
- Open‑Source Tooling: Because the method relies on sparse 3D tracks—obtainable via off‑the‑shelf SfM libraries—the approach can be integrated into existing video editing suites with modest engineering effort.
Limitations & Future Work
- Track Acquisition Overhead: While sparse, generating accurate 3D tracks still requires reliable structure‑from‑motion pipelines; failure cases (low texture, fast motion) can degrade results.
- Complex Non‑Rigid Motions: Extremely high‑frequency deformations (e.g., water splashes) remain challenging due to the limited granularity of sparse tracks.
- Scalability to Long Clips: Temporal memory is bounded; very long sequences may need chunked processing, potentially introducing seams.
- Future Directions: The authors suggest exploring learned track inference (jointly estimating 3D tracks and video synthesis), interactive UI tools for on‑the‑fly track editing, and extending the framework to multi‑camera setups for stereoscopic or 360° content.
Authors
- Yao-Chih Lee
- Zhoutong Zhang
- Jiahui Huang
- Jui-Hsien Wang
- Joon-Young Lee
- Jia-Bin Huang
- Eli Shechtman
- Zhengqi Li
Paper Information
- arXiv ID: 2512.02015v1
- Categories: cs.CV
- Published: December 1, 2025
- PDF: Download PDF