[Paper] VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

Published: 1 week ago (January 30, 2026 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.23286v1

Overview

The paper VideoGPA: Distilling Geometry Priors for 3D‑Consistent Video Generation tackles a persistent problem in video diffusion models: while they can create striking frames, the resulting videos often suffer from jittery objects, warped shapes, or drifting perspectives. By injecting geometric knowledge from a dedicated “geometry foundation model” into the diffusion process, the authors show how to coax the generator toward naturally coherent 3‑D structure—without any hand‑crafted labels.

Key Contributions

Geometry‑driven preference signals: Introduces a self‑supervised pipeline that extracts dense, frame‑level geometry cues (depth, surface normals, etc.) from a pretrained geometry model and turns them into preference pairs for the video diffusion model.
Direct Preference Optimization (DPO) for videos: Adapts the recent DPO technique—originally used for language models—to steer video diffusion training using the geometry‑derived preferences.
Data‑efficient learning: Demonstrates that only a few thousand preference pairs are enough to achieve noticeable 3‑D consistency, dramatically reducing the annotation burden.
Comprehensive evaluation: Shows consistent gains over state‑of‑the‑art video diffusion baselines across temporal stability, physical plausibility, and motion coherence metrics.
Open‑source implementation: Releases code, pretrained checkpoints, and a lightweight inference script, making it easy for practitioners to plug the method into existing pipelines.

Methodology

Geometry foundation model: The authors start with a pretrained model that predicts dense depth and surface‑normal maps from single images (e.g., MiDaS or a recent multi‑task vision transformer).
Preference pair generation: For a given video prompt, the diffusion model samples two candidate videos. The geometry model evaluates each frame pairwise, scoring how well the depth/normal fields align temporally. The higher‑scoring video becomes the “preferred” sample, while the lower one is the “non‑preferred” sample.
Direct Preference Optimization: Using the (preferred, non‑preferred) pairs, DPO updates the diffusion model’s parameters to increase the likelihood of the preferred video while decreasing that of the non‑preferred one. This is done via a simple binary cross‑entropy loss on the model’s log‑probabilities, avoiding the need for reinforcement‑learning tricks.
Training loop: The process repeats across many prompts, but because the geometry model provides dense supervision automatically, the overall training cost stays modest. At inference time, only the diffusion model is needed—no geometry model is required.

Results & Findings

Temporal stability: VideoGPA reduces frame‑to‑frame pixel drift by ~30 % compared to the baseline VDM, as measured by the Learned Perceptual Image Patch Similarity (LPIPS) across consecutive frames.
Physical plausibility: Depth consistency scores improve by 0.12 on average, indicating that objects maintain realistic shape and scale throughout motion.
Motion coherence: Optical‑flow based metrics (e.g., End‑Point Error) show a 15 % drop, meaning generated motion aligns better with the underlying 3‑D scene.
Human evaluation: In a blind study with 200 participants, 68 % preferred videos generated with VideoGPA over the strongest competing method, citing “less wobbling” and “more believable depth.”
Efficiency: The model achieves these gains with only ~5 k preference pairs, a fraction of the data required by prior self‑supervised consistency tricks.

Practical Implications

Content creation pipelines: Studios and indie developers can generate longer, more stable video assets (e.g., background loops, product demos) without manually retouching frame‑by‑frame.
AR/VR and game prototyping: Real‑time video generation for immersive experiences can now preserve spatial coherence, reducing the need for separate geometry pipelines.
Synthetic data for training: Autonomous‑driving or robotics simulators that rely on synthetic video can benefit from more physically plausible scenes, improving downstream model robustness.
Plug‑and‑play upgrade: Because the geometry model is only needed during training, existing diffusion‑based video generators can be upgraded with a single fine‑tuning step, keeping inference latency unchanged.

Limitations & Future Work

Geometry model bias: The approach inherits any systematic errors of the underlying depth/normal predictor (e.g., failure on reflective surfaces).
Scalability to high‑resolution video: Preference generation currently operates at 256 × 256; extending to 4K video may require more efficient geometry inference or hierarchical training.
Complex motion patterns: Extremely fast or non‑rigid deformations (e.g., fluid dynamics) still challenge the current preference signal, suggesting a need for richer physical priors.
Future directions: The authors plan to explore multi‑modal geometry cues (e.g., surface reflectance), integrate learned camera pose estimation, and test the framework on text‑to‑video models with larger latent spaces.

Authors

Hongyang Du
Junjie Ye
Xiaoyan Cong
Runhao Li
Jingcheng Ni
Aman Agarwal
Zeqi Zhou
Zekun Li
Randall Balestriero
Yue Wang

Paper Information

arXiv ID: 2601.23286v1
Categories: cs.CV, cs.AI, cs.LG
Published: January 30, 2026
PDF: Download PDF

[Paper] VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Denoising the Deep Sky: Physics-Based CCD Noise Formation for Astronomical Imaging

[Paper] Training-Free Test-Time Adaptation with Brownian Distance Covariance in Vision-Language Models

[Paper] ShotFinder: Imagination-Driven Open-Domain Video Shot Retrieval via Web Search

[Paper] Med-Scout: Curing MLLMs' Geometric Blindness in Medical Perception via Geometry-Aware RL Post-Training