[Paper] G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning
Source: arXiv - 2511.21688v1
Overview
Vision-Language Models (VLMs) still lack robustness in spatial intelligence, demonstrating poor performance on spatial understanding and reasoning tasks. We attribute this gap to the absence of a visual geometry learning process capable of reconstructing 3D space from 2D images. We present G$^2$VLM, a geometry‑grounded vision‑language model that bridges two fundamental aspects of spatial intelligence: spatial 3D reconstruction and spatial understanding.
- G$^2$VLM natively leverages learned 3D visual geometry features to directly predict 3D attributes.
- It enhances spatial reasoning tasks via in‑context learning and interleaved reasoning.
- The unified design is highly scalable: it trains on abundant multi‑view image and video data while benefiting from 3D visual priors that are typically derived from hard‑to‑collect annotations.
Experimental results show that G$^2$VLM is proficient in both tasks, achieving comparable results to state‑of‑the‑art feed‑forward 3D reconstruction models and delivering better or competitive performance across spatial understanding and reasoning benchmarks. By unifying a semantically strong VLM with low‑level 3D vision tasks, we hope G$^2$VLM can serve as a strong baseline for the community and unlock future applications such as 3D scene editing.
Authors
- Wenbo Hu
- Jingli Lin
- Yilin Long
- Yunlong Ran
- Lihan Jiang
- Yifan Wang
- Chenming Zhu
- Runsen Xu
- Tai Wang
- Jiangmiao Pang
Categories
- cs.CV
- cs.AI
- cs.CL
Paper Information
- arXiv ID: 2511.21688v1
- Published: November 27, 2025
- PDF: Download PDF