[Paper] G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

Published: (November 26, 2025 at 01:59 PM EST)
1 min read
Source: arXiv

Source: arXiv - 2511.21688v1

Overview

Vision-Language Models (VLMs) still lack robustness in spatial intelligence, demonstrating poor performance on spatial understanding and reasoning tasks. We attribute this gap to the absence of a visual geometry learning process capable of reconstructing 3D space from 2D images. We present G$^2$VLM, a geometry‑grounded vision‑language model that bridges two fundamental aspects of spatial intelligence: spatial 3D reconstruction and spatial understanding.

  • G$^2$VLM natively leverages learned 3D visual geometry features to directly predict 3D attributes.
  • It enhances spatial reasoning tasks via in‑context learning and interleaved reasoning.
  • The unified design is highly scalable: it trains on abundant multi‑view image and video data while benefiting from 3D visual priors that are typically derived from hard‑to‑collect annotations.

Experimental results show that G$^2$VLM is proficient in both tasks, achieving comparable results to state‑of‑the‑art feed‑forward 3D reconstruction models and delivering better or competitive performance across spatial understanding and reasoning benchmarks. By unifying a semantically strong VLM with low‑level 3D vision tasks, we hope G$^2$VLM can serve as a strong baseline for the community and unlock future applications such as 3D scene editing.

Authors

  • Wenbo Hu
  • Jingli Lin
  • Yilin Long
  • Yunlong Ran
  • Lihan Jiang
  • Yifan Wang
  • Chenming Zhu
  • Runsen Xu
  • Tai Wang
  • Jiangmiao Pang

Categories

  • cs.CV
  • cs.AI
  • cs.CL

Paper Information

  • arXiv ID: 2511.21688v1
  • Published: November 27, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »