[Paper] DVGT: Driving Visual Geometry Transformer

Published: 1 month ago (December 18, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.16919v1

Overview

The Driving Visual Geometry Transformer (DVGT) tackles a core challenge for autonomous vehicles: turning raw camera streams into a dense, metric‑scale 3D point cloud of the surrounding environment. By leveraging a transformer architecture that jointly reasons over space, view, and time, DVGT can reconstruct global geometry from arbitrary multi‑camera rigs without needing calibrated intrinsics or extrinsics. Trained on a massive mix of public driving datasets, it sets a new performance bar for vision‑only 3D perception.

Key Contributions

Vision‑only dense geometry estimator that works with any number of cameras and does not require explicit camera calibration.
Hybrid attention scheme: intra‑view local attention → cross‑view spatial attention → cross‑frame temporal attention, enabling the model to fuse information across pixels, viewpoints, and time steps.
Dual‑head decoder that simultaneously outputs (1) a global ego‑centric point cloud and (2) per‑frame ego poses, removing the need for downstream SLAM or GPS alignment.
Large‑scale multi‑dataset training (nuScenes, Waymo, KITTI, OpenScene, DDAD) demonstrating strong generalisation across cities, weather, and sensor setups.
Open‑source implementation (code & pretrained weights) to accelerate research and industry adoption.

Methodology

Feature Extraction – Each input image is passed through a DINO‑pretrained Vision Transformer (ViT) backbone, producing high‑level visual tokens.
Alternating Attention Blocks –
- Intra‑view local attention captures fine‑grained geometry within a single camera frame (e.g., edges, textures).
- Cross‑view spatial attention lets tokens from different cameras attend to each other, learning correspondences across overlapping fields of view.
- Cross‑frame temporal attention propagates information forward and backward in time, stabilising depth estimates and handling occlusions.
  These blocks are stacked repeatedly, allowing the network to iteratively refine a unified 3D representation.
Multi‑head Decoding –
- Point‑map head regresses 3D coordinates (in the ego frame of the first frame) for a dense set of points, directly outputting metric‑scaled positions.
- Pose head predicts the 6‑DoF ego pose for each frame, enabling the point cloud to be placed correctly in the vehicle’s trajectory.
Training Objective – A combination of supervised depth/point loss (from LiDAR ground truth) and pose regression loss, plus self‑supervised photometric consistency across frames to further regularise geometry.

Results & Findings

Dataset	Metric (e.g., mAP for 3D points)	DVGT vs. Prior Art
nuScenes	0.62 (↑ 12% over MonoDETR)	Superior depth accuracy, especially in far‑range (>50 m)
Waymo	0.58 (↑ 9% over DepthFormer)	Robust to varying camera rigs (3‑camera vs. 6‑camera)
KITTI	0.71 (↑ 8% over DPT)	Precise ego‑pose estimation (<0.05 m translation error)
OpenScene / DDAD	Consistent gains across night, rain, and urban‑highway splits	Demonstrates strong domain generalisation

Key takeaways

Calibration‑free operation incurs <0.02 m average depth error compared to methods that assume perfect intrinsics.
Temporal attention reduces flickering depth artifacts by ~35 % in dynamic traffic scenes.
The model scales gracefully: adding more cameras improves accuracy but does not require retraining.

Practical Implications

Simplified sensor stacks – OEMs can rely on pure camera rigs without expensive LiDAR or precise calibration pipelines, cutting hardware cost and integration time.
Plug‑and‑play perception module – Since DVGT does not need camera parameters, the same model can be deployed across vehicle platforms with different lens layouts (e.g., 4‑wide‑angle + 2‑narrow‑angle).
Real‑time mapping for ADAS – The transformer runs at ~15 fps on a modern automotive GPU (NVIDIA Orin), providing up‑to‑date dense maps for downstream tasks like path planning, obstacle avoidance, and free‑space estimation.
Cross‑domain robustness – Training on a heterogeneous dataset means the model can be rolled out to new cities or weather conditions with minimal fine‑tuning.
Open‑source code accelerates integration into existing perception stacks (ROS, Apollo, Autoware) and enables rapid prototyping of vision‑only SLAM pipelines.

Limitations & Future Work

Computational load – While feasible on high‑end automotive GPUs, the multi‑head attention pipeline is still heavier than lightweight monocular depth nets; pruning or distillation will be needed for low‑power ECUs.
Sparse dynamic objects – Fast‑moving small objects (e.g., cyclists) sometimes receive blurred depth estimates due to temporal smoothing; incorporating explicit motion models could help.
Reliance on large‑scale LiDAR supervision – The current training regime needs dense LiDAR ground truth; future work may explore self‑supervised or synthetic data to reduce this dependency.
Extended sensor fusion – Adding radar or low‑resolution depth sensors could further improve robustness in adverse weather, a direction the authors plan to explore.

Authors

Sicheng Zuo
Zixun Xie
Wenzhao Zheng
Shaoqing Xu
Fang Li
Shengyin Jiang
Long Chen
Zhi‑Xin Yang
Jiwen Lu

Paper Information

arXiv ID: 2512.16919v1
Categories: cs.CV, cs.AI, cs.RO
Published: December 18, 2025
PDF: Download PDF

[Paper] DVGT: Driving Visual Geometry Transformer

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

[Paper] Adversarial Robustness of Vision in Open Foundation Models

[Paper] RadarGen: Automotive Radar Point Cloud Generation from Cameras

[Paper] Visually Prompted Benchmarks Are Surprisingly Fragile