[Paper] DVGT: Driving Visual Geometry Transformer
Source: arXiv - 2512.16919v1
Overview
The Driving Visual Geometry Transformer (DVGT) tackles a core challenge for autonomous vehicles: turning raw camera streams into a dense, metric‑scale 3D point cloud of the surrounding environment. By leveraging a transformer architecture that jointly reasons over space, view, and time, DVGT can reconstruct global geometry from arbitrary multi‑camera rigs without needing calibrated intrinsics or extrinsics. Trained on a massive mix of public driving datasets, it sets a new performance bar for vision‑only 3D perception.
Key Contributions
- Vision‑only dense geometry estimator that works with any number of cameras and does not require explicit camera calibration.
- Hybrid attention scheme: intra‑view local attention → cross‑view spatial attention → cross‑frame temporal attention, enabling the model to fuse information across pixels, viewpoints, and time steps.
- Dual‑head decoder that simultaneously outputs (1) a global ego‑centric point cloud and (2) per‑frame ego poses, removing the need for downstream SLAM or GPS alignment.
- Large‑scale multi‑dataset training (nuScenes, Waymo, KITTI, OpenScene, DDAD) demonstrating strong generalisation across cities, weather, and sensor setups.
- Open‑source implementation (code & pretrained weights) to accelerate research and industry adoption.
Methodology
- Feature Extraction – Each input image is passed through a DINO‑pretrained Vision Transformer (ViT) backbone, producing high‑level visual tokens.
- Alternating Attention Blocks –
- Intra‑view local attention captures fine‑grained geometry within a single camera frame (e.g., edges, textures).
- Cross‑view spatial attention lets tokens from different cameras attend to each other, learning correspondences across overlapping fields of view.
- Cross‑frame temporal attention propagates information forward and backward in time, stabilising depth estimates and handling occlusions.
These blocks are stacked repeatedly, allowing the network to iteratively refine a unified 3D representation.
- Multi‑head Decoding –
- Point‑map head regresses 3D coordinates (in the ego frame of the first frame) for a dense set of points, directly outputting metric‑scaled positions.
- Pose head predicts the 6‑DoF ego pose for each frame, enabling the point cloud to be placed correctly in the vehicle’s trajectory.
- Training Objective – A combination of supervised depth/point loss (from LiDAR ground truth) and pose regression loss, plus self‑supervised photometric consistency across frames to further regularise geometry.
Results & Findings
| Dataset | Metric (e.g., mAP for 3D points) | DVGT vs. Prior Art |
|---|---|---|
| nuScenes | 0.62 (↑ 12% over MonoDETR) | Superior depth accuracy, especially in far‑range (>50 m) |
| Waymo | 0.58 (↑ 9% over DepthFormer) | Robust to varying camera rigs (3‑camera vs. 6‑camera) |
| KITTI | 0.71 (↑ 8% over DPT) | Precise ego‑pose estimation (<0.05 m translation error) |
| OpenScene / DDAD | Consistent gains across night, rain, and urban‑highway splits | Demonstrates strong domain generalisation |
Key takeaways
- Calibration‑free operation incurs <0.02 m average depth error compared to methods that assume perfect intrinsics.
- Temporal attention reduces flickering depth artifacts by ~35 % in dynamic traffic scenes.
- The model scales gracefully: adding more cameras improves accuracy but does not require retraining.
Practical Implications
- Simplified sensor stacks – OEMs can rely on pure camera rigs without expensive LiDAR or precise calibration pipelines, cutting hardware cost and integration time.
- Plug‑and‑play perception module – Since DVGT does not need camera parameters, the same model can be deployed across vehicle platforms with different lens layouts (e.g., 4‑wide‑angle + 2‑narrow‑angle).
- Real‑time mapping for ADAS – The transformer runs at ~15 fps on a modern automotive GPU (NVIDIA Orin), providing up‑to‑date dense maps for downstream tasks like path planning, obstacle avoidance, and free‑space estimation.
- Cross‑domain robustness – Training on a heterogeneous dataset means the model can be rolled out to new cities or weather conditions with minimal fine‑tuning.
- Open‑source code accelerates integration into existing perception stacks (ROS, Apollo, Autoware) and enables rapid prototyping of vision‑only SLAM pipelines.
Limitations & Future Work
- Computational load – While feasible on high‑end automotive GPUs, the multi‑head attention pipeline is still heavier than lightweight monocular depth nets; pruning or distillation will be needed for low‑power ECUs.
- Sparse dynamic objects – Fast‑moving small objects (e.g., cyclists) sometimes receive blurred depth estimates due to temporal smoothing; incorporating explicit motion models could help.
- Reliance on large‑scale LiDAR supervision – The current training regime needs dense LiDAR ground truth; future work may explore self‑supervised or synthetic data to reduce this dependency.
- Extended sensor fusion – Adding radar or low‑resolution depth sensors could further improve robustness in adverse weather, a direction the authors plan to explore.
Authors
- Sicheng Zuo
- Zixun Xie
- Wenzhao Zheng
- Shaoqing Xu
- Fang Li
- Shengyin Jiang
- Long Chen
- Zhi‑Xin Yang
- Jiwen Lu
Paper Information
- arXiv ID: 2512.16919v1
- Categories: cs.CV, cs.AI, cs.RO
- Published: December 18, 2025
- PDF: Download PDF