[Paper] VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation
Source: arXiv - 2601.23286v1
Overview
The paper VideoGPA: Distilling Geometry Priors for 3D‑Consistent Video Generation tackles a persistent problem in video diffusion models: while they can create striking frames, the resulting videos often suffer from jittery objects, warped shapes, or drifting perspectives. By injecting geometric knowledge from a dedicated “geometry foundation model” into the diffusion process, the authors show how to coax the generator toward naturally coherent 3‑D structure—without any hand‑crafted labels.
Key Contributions
- Geometry‑driven preference signals: Introduces a self‑supervised pipeline that extracts dense, frame‑level geometry cues (depth, surface normals, etc.) from a pretrained geometry model and turns them into preference pairs for the video diffusion model.
- Direct Preference Optimization (DPO) for videos: Adapts the recent DPO technique—originally used for language models—to steer video diffusion training using the geometry‑derived preferences.
- Data‑efficient learning: Demonstrates that only a few thousand preference pairs are enough to achieve noticeable 3‑D consistency, dramatically reducing the annotation burden.
- Comprehensive evaluation: Shows consistent gains over state‑of‑the‑art video diffusion baselines across temporal stability, physical plausibility, and motion coherence metrics.
- Open‑source implementation: Releases code, pretrained checkpoints, and a lightweight inference script, making it easy for practitioners to plug the method into existing pipelines.
Methodology
- Geometry foundation model: The authors start with a pretrained model that predicts dense depth and surface‑normal maps from single images (e.g., MiDaS or a recent multi‑task vision transformer).
- Preference pair generation: For a given video prompt, the diffusion model samples two candidate videos. The geometry model evaluates each frame pairwise, scoring how well the depth/normal fields align temporally. The higher‑scoring video becomes the “preferred” sample, while the lower one is the “non‑preferred” sample.
- Direct Preference Optimization: Using the (preferred, non‑preferred) pairs, DPO updates the diffusion model’s parameters to increase the likelihood of the preferred video while decreasing that of the non‑preferred one. This is done via a simple binary cross‑entropy loss on the model’s log‑probabilities, avoiding the need for reinforcement‑learning tricks.
- Training loop: The process repeats across many prompts, but because the geometry model provides dense supervision automatically, the overall training cost stays modest. At inference time, only the diffusion model is needed—no geometry model is required.
Results & Findings
- Temporal stability: VideoGPA reduces frame‑to‑frame pixel drift by ~30 % compared to the baseline VDM, as measured by the Learned Perceptual Image Patch Similarity (LPIPS) across consecutive frames.
- Physical plausibility: Depth consistency scores improve by 0.12 on average, indicating that objects maintain realistic shape and scale throughout motion.
- Motion coherence: Optical‑flow based metrics (e.g., End‑Point Error) show a 15 % drop, meaning generated motion aligns better with the underlying 3‑D scene.
- Human evaluation: In a blind study with 200 participants, 68 % preferred videos generated with VideoGPA over the strongest competing method, citing “less wobbling” and “more believable depth.”
- Efficiency: The model achieves these gains with only ~5 k preference pairs, a fraction of the data required by prior self‑supervised consistency tricks.
Practical Implications
- Content creation pipelines: Studios and indie developers can generate longer, more stable video assets (e.g., background loops, product demos) without manually retouching frame‑by‑frame.
- AR/VR and game prototyping: Real‑time video generation for immersive experiences can now preserve spatial coherence, reducing the need for separate geometry pipelines.
- Synthetic data for training: Autonomous‑driving or robotics simulators that rely on synthetic video can benefit from more physically plausible scenes, improving downstream model robustness.
- Plug‑and‑play upgrade: Because the geometry model is only needed during training, existing diffusion‑based video generators can be upgraded with a single fine‑tuning step, keeping inference latency unchanged.
Limitations & Future Work
- Geometry model bias: The approach inherits any systematic errors of the underlying depth/normal predictor (e.g., failure on reflective surfaces).
- Scalability to high‑resolution video: Preference generation currently operates at 256 × 256; extending to 4K video may require more efficient geometry inference or hierarchical training.
- Complex motion patterns: Extremely fast or non‑rigid deformations (e.g., fluid dynamics) still challenge the current preference signal, suggesting a need for richer physical priors.
- Future directions: The authors plan to explore multi‑modal geometry cues (e.g., surface reflectance), integrate learned camera pose estimation, and test the framework on text‑to‑video models with larger latent spaces.
Authors
- Hongyang Du
- Junjie Ye
- Xiaoyan Cong
- Runhao Li
- Jingcheng Ni
- Aman Agarwal
- Zeqi Zhou
- Zekun Li
- Randall Balestriero
- Yue Wang
Paper Information
- arXiv ID: 2601.23286v1
- Categories: cs.CV, cs.AI, cs.LG
- Published: January 30, 2026
- PDF: Download PDF