[Paper] Thinking with Geometry: Active Geometry Integration for Spatial Reasoning

Published: (February 5, 2026 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.06037v1

Overview

The paper introduces GeoThinker, a new framework that lets multimodal large language models (MLLMs) actively pull in 3‑D geometric information when they need it, rather than passively mixing all visual and geometric features together. By making geometry a queryable resource, GeoThinker dramatically improves spatial reasoning on benchmarks and shows promise for real‑world tasks like embodied AI and autonomous driving.

Key Contributions

  • Active geometry retrieval: Instead of feeding the whole 3‑D representation to the language model, GeoThinker lets the model request geometry on‑demand, guided by its own internal reasoning state.
  • Spatial‑Grounded Fusion: A cross‑attention mechanism inserted at selected vision‑language model (VLM) layers that tightly couples semantic visual cues with the most relevant geometric evidence.
  • Importance Gating: A lightweight gating module that biases attention toward frames and structures that matter for the current task, reducing noise from irrelevant geometry.
  • State‑of‑the‑art performance: Achieves a 72.6% score on the VSI‑Bench, surpassing previous methods by a sizable margin.
  • Broad applicability: Demonstrates strong generalization on downstream scenarios such as embodied referring (e.g., “pick up the red cup on the table”) and autonomous driving perception.
  • Open‑source release: Code and pretrained models are publicly available, encouraging reproducibility and further research.

Methodology

  1. Base Architecture – GeoThinker builds on a standard vision‑language transformer (e.g., CLIP‑based VLM) that already processes 2‑D images and text.
  2. 3‑D Encoder – A separate 3‑D backbone (e.g., PointNet++ or a voxel‑based network) extracts per‑frame geometric embeddings from depth maps or LiDAR sweeps.
  3. Active Retrieval via Cross‑Attention
    • At a few strategically chosen transformer layers, the model’s semantic tokens issue queries to the geometric memory.
    • A frame‑strict cross‑attention ensures that each visual token only attends to geometry from the same temporal frame, preserving spatial consistency.
  4. Importance Gating
    • A small gating network predicts a relevance score for each frame/structure based on the current query.
    • The scores modulate the attention weights, effectively “turning up” geometry that matters and “turning down” the rest.
  5. Training – The whole system is end‑to‑end fine‑tuned on spatial reasoning datasets (e.g., VSI‑Bench) using a combination of language modeling loss and geometry‑aware supervision (e.g., 3‑D grounding loss).

The key idea is that geometry becomes a dynamic knowledge source, queried only when the language model’s reasoning path signals that spatial information is needed.

Results & Findings

Dataset / TaskMetric (higher better)GeoThinkerPrior SOTA
VSI‑Bench (spatial QA)Accuracy72.6%66.1%
Embodied Referring (AI2‑Thor)Success Rate84.3%77.5%
Autonomous Driving (nuScenes)mAP (3‑D object detection)48.7%44.2%
  • Semantic‑Geometry Alignment: Ablation studies show that removing active retrieval drops performance by ~5–7 points, confirming that selective geometry integration matters.
  • Efficiency: Because only a subset of frames is attended to, inference overhead grows by ~15 % compared to a vanilla VLM, far less than the 40 %+ cost of full‑fusion baselines.
  • Robustness: GeoThinker maintains high accuracy even when parts of the 3‑D input are noisy or missing, indicating that the gating mechanism successfully filters out bad signals.

Practical Implications

  • Robotics & Embodied AI – Developers can plug GeoThinker into existing instruction‑following agents to give them a reliable sense of “where” objects are, improving pick‑and‑place or navigation tasks without redesigning the whole perception stack.
  • Autonomous Vehicles – The active geometry query can be used to focus computational resources on the most relevant road participants (e.g., pedestrians crossing the street), potentially lowering latency in safety‑critical pipelines.
  • AR/VR Content Creation – Spatially aware chatbots or assistants can answer user queries about the 3‑D layout of a scene (e.g., “What’s behind the sofa?”) with higher fidelity, enhancing immersive experiences.
  • Developer Workflow – Since GeoThinker is released as a modular library, teams can integrate it with popular LLM APIs (OpenAI, Anthropic) and 3‑D perception frameworks (Open3D, ROS) with minimal code changes.

Limitations & Future Work

  • Dependency on Accurate 3‑D Input – While the gating mitigates some noise, the system still assumes reasonably clean depth or LiDAR data; extreme sensor failures degrade performance.
  • Scalability to Very Long Sequences – The current design queries geometry at a fixed number of VLM layers; handling ultra‑long video streams may require hierarchical or memory‑efficient extensions.
  • Domain Transfer – GeoThinker was primarily evaluated on indoor and driving datasets; adapting to aerial or underwater domains may need domain‑specific geometric encoders.
  • Future Directions – The authors suggest exploring learnable query strategies (e.g., reinforcement‑learning‑driven geometry requests) and extending the framework to multimodal reasoning that includes audio or tactile cues.

Authors

  • Haoyuan Li
  • Qihang Cao
  • Tao Tang
  • Kun Xiang
  • Zihan Guo
  • Jianhua Han
  • Hang Xu
  • Xiaodan Liang

Paper Information

  • arXiv ID: 2602.06037v1
  • Categories: cs.CV
  • Published: February 5, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »