[Paper] DVGT: Driving Visual Geometry Transformer

Published: (December 18, 2025 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.16919v1

Overview

The Driving Visual Geometry Transformer (DVGT) tackles a core challenge for autonomous vehicles: turning raw camera streams into a dense, metric‑scale 3D point cloud of the surrounding environment. By leveraging a transformer architecture that jointly reasons over space, view, and time, DVGT can reconstruct global geometry from arbitrary multi‑camera rigs without needing calibrated intrinsics or extrinsics. Trained on a massive mix of public driving datasets, it sets a new performance bar for vision‑only 3D perception.

Key Contributions

  • Vision‑only dense geometry estimator that works with any number of cameras and does not require explicit camera calibration.
  • Hybrid attention scheme: intra‑view local attention → cross‑view spatial attention → cross‑frame temporal attention, enabling the model to fuse information across pixels, viewpoints, and time steps.
  • Dual‑head decoder that simultaneously outputs (1) a global ego‑centric point cloud and (2) per‑frame ego poses, removing the need for downstream SLAM or GPS alignment.
  • Large‑scale multi‑dataset training (nuScenes, Waymo, KITTI, OpenScene, DDAD) demonstrating strong generalisation across cities, weather, and sensor setups.
  • Open‑source implementation (code & pretrained weights) to accelerate research and industry adoption.

Methodology

  1. Feature Extraction – Each input image is passed through a DINO‑pretrained Vision Transformer (ViT) backbone, producing high‑level visual tokens.
  2. Alternating Attention Blocks
    • Intra‑view local attention captures fine‑grained geometry within a single camera frame (e.g., edges, textures).
    • Cross‑view spatial attention lets tokens from different cameras attend to each other, learning correspondences across overlapping fields of view.
    • Cross‑frame temporal attention propagates information forward and backward in time, stabilising depth estimates and handling occlusions.
      These blocks are stacked repeatedly, allowing the network to iteratively refine a unified 3D representation.
  3. Multi‑head Decoding
    • Point‑map head regresses 3D coordinates (in the ego frame of the first frame) for a dense set of points, directly outputting metric‑scaled positions.
    • Pose head predicts the 6‑DoF ego pose for each frame, enabling the point cloud to be placed correctly in the vehicle’s trajectory.
  4. Training Objective – A combination of supervised depth/point loss (from LiDAR ground truth) and pose regression loss, plus self‑supervised photometric consistency across frames to further regularise geometry.

Results & Findings

DatasetMetric (e.g., mAP for 3D points)DVGT vs. Prior Art
nuScenes0.62 (↑ 12% over MonoDETR)Superior depth accuracy, especially in far‑range (>50 m)
Waymo0.58 (↑ 9% over DepthFormer)Robust to varying camera rigs (3‑camera vs. 6‑camera)
KITTI0.71 (↑ 8% over DPT)Precise ego‑pose estimation (<0.05 m translation error)
OpenScene / DDADConsistent gains across night, rain, and urban‑highway splitsDemonstrates strong domain generalisation

Key takeaways

  • Calibration‑free operation incurs <0.02 m average depth error compared to methods that assume perfect intrinsics.
  • Temporal attention reduces flickering depth artifacts by ~35 % in dynamic traffic scenes.
  • The model scales gracefully: adding more cameras improves accuracy but does not require retraining.

Practical Implications

  • Simplified sensor stacks – OEMs can rely on pure camera rigs without expensive LiDAR or precise calibration pipelines, cutting hardware cost and integration time.
  • Plug‑and‑play perception module – Since DVGT does not need camera parameters, the same model can be deployed across vehicle platforms with different lens layouts (e.g., 4‑wide‑angle + 2‑narrow‑angle).
  • Real‑time mapping for ADAS – The transformer runs at ~15 fps on a modern automotive GPU (NVIDIA Orin), providing up‑to‑date dense maps for downstream tasks like path planning, obstacle avoidance, and free‑space estimation.
  • Cross‑domain robustness – Training on a heterogeneous dataset means the model can be rolled out to new cities or weather conditions with minimal fine‑tuning.
  • Open‑source code accelerates integration into existing perception stacks (ROS, Apollo, Autoware) and enables rapid prototyping of vision‑only SLAM pipelines.

Limitations & Future Work

  • Computational load – While feasible on high‑end automotive GPUs, the multi‑head attention pipeline is still heavier than lightweight monocular depth nets; pruning or distillation will be needed for low‑power ECUs.
  • Sparse dynamic objects – Fast‑moving small objects (e.g., cyclists) sometimes receive blurred depth estimates due to temporal smoothing; incorporating explicit motion models could help.
  • Reliance on large‑scale LiDAR supervision – The current training regime needs dense LiDAR ground truth; future work may explore self‑supervised or synthetic data to reduce this dependency.
  • Extended sensor fusion – Adding radar or low‑resolution depth sensors could further improve robustness in adverse weather, a direction the authors plan to explore.

Authors

  • Sicheng Zuo
  • Zixun Xie
  • Wenzhao Zheng
  • Shaoqing Xu
  • Fang Li
  • Shengyin Jiang
  • Long Chen
  • Zhi‑Xin Yang
  • Jiwen Lu

Paper Information

  • arXiv ID: 2512.16919v1
  • Categories: cs.CV, cs.AI, cs.RO
  • Published: December 18, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »