[Paper] Thinking with Geometry: Active Geometry Integration for Spatial Reasoning

Published: 3 days ago (February 5, 2026 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.06037v1

Overview

The paper introduces GeoThinker, a new framework that lets multimodal large language models (MLLMs) actively pull in 3‑D geometric information when they need it, rather than passively mixing all visual and geometric features together. By making geometry a queryable resource, GeoThinker dramatically improves spatial reasoning on benchmarks and shows promise for real‑world tasks like embodied AI and autonomous driving.

Key Contributions

Active geometry retrieval: Instead of feeding the whole 3‑D representation to the language model, GeoThinker lets the model request geometry on‑demand, guided by its own internal reasoning state.
Spatial‑Grounded Fusion: A cross‑attention mechanism inserted at selected vision‑language model (VLM) layers that tightly couples semantic visual cues with the most relevant geometric evidence.
Importance Gating: A lightweight gating module that biases attention toward frames and structures that matter for the current task, reducing noise from irrelevant geometry.
State‑of‑the‑art performance: Achieves a 72.6% score on the VSI‑Bench, surpassing previous methods by a sizable margin.
Broad applicability: Demonstrates strong generalization on downstream scenarios such as embodied referring (e.g., “pick up the red cup on the table”) and autonomous driving perception.
Open‑source release: Code and pretrained models are publicly available, encouraging reproducibility and further research.

Methodology

Base Architecture – GeoThinker builds on a standard vision‑language transformer (e.g., CLIP‑based VLM) that already processes 2‑D images and text.
3‑D Encoder – A separate 3‑D backbone (e.g., PointNet++ or a voxel‑based network) extracts per‑frame geometric embeddings from depth maps or LiDAR sweeps.
Active Retrieval via Cross‑Attention
- At a few strategically chosen transformer layers, the model’s semantic tokens issue queries to the geometric memory.
- A frame‑strict cross‑attention ensures that each visual token only attends to geometry from the same temporal frame, preserving spatial consistency.
Importance Gating
- A small gating network predicts a relevance score for each frame/structure based on the current query.
- The scores modulate the attention weights, effectively “turning up” geometry that matters and “turning down” the rest.
Training – The whole system is end‑to‑end fine‑tuned on spatial reasoning datasets (e.g., VSI‑Bench) using a combination of language modeling loss and geometry‑aware supervision (e.g., 3‑D grounding loss).

The key idea is that geometry becomes a dynamic knowledge source, queried only when the language model’s reasoning path signals that spatial information is needed.

Results & Findings

Dataset / Task	Metric (higher better)	GeoThinker	Prior SOTA
VSI‑Bench (spatial QA)	Accuracy	72.6%	66.1%
Embodied Referring (AI2‑Thor)	Success Rate	84.3%	77.5%
Autonomous Driving (nuScenes)	mAP (3‑D object detection)	48.7%	44.2%

Semantic‑Geometry Alignment: Ablation studies show that removing active retrieval drops performance by ~5–7 points, confirming that selective geometry integration matters.
Efficiency: Because only a subset of frames is attended to, inference overhead grows by ~15 % compared to a vanilla VLM, far less than the 40 %+ cost of full‑fusion baselines.
Robustness: GeoThinker maintains high accuracy even when parts of the 3‑D input are noisy or missing, indicating that the gating mechanism successfully filters out bad signals.

Practical Implications

Robotics & Embodied AI – Developers can plug GeoThinker into existing instruction‑following agents to give them a reliable sense of “where” objects are, improving pick‑and‑place or navigation tasks without redesigning the whole perception stack.
Autonomous Vehicles – The active geometry query can be used to focus computational resources on the most relevant road participants (e.g., pedestrians crossing the street), potentially lowering latency in safety‑critical pipelines.
AR/VR Content Creation – Spatially aware chatbots or assistants can answer user queries about the 3‑D layout of a scene (e.g., “What’s behind the sofa?”) with higher fidelity, enhancing immersive experiences.
Developer Workflow – Since GeoThinker is released as a modular library, teams can integrate it with popular LLM APIs (OpenAI, Anthropic) and 3‑D perception frameworks (Open3D, ROS) with minimal code changes.

Limitations & Future Work

Dependency on Accurate 3‑D Input – While the gating mitigates some noise, the system still assumes reasonably clean depth or LiDAR data; extreme sensor failures degrade performance.
Scalability to Very Long Sequences – The current design queries geometry at a fixed number of VLM layers; handling ultra‑long video streams may require hierarchical or memory‑efficient extensions.
Domain Transfer – GeoThinker was primarily evaluated on indoor and driving datasets; adapting to aerial or underwater domains may need domain‑specific geometric encoders.
Future Directions – The authors suggest exploring learnable query strategies (e.g., reinforcement‑learning‑driven geometry requests) and extending the framework to multimodal reasoning that includes audio or tactile cues.

Authors

Haoyuan Li
Qihang Cao
Tao Tang
Kun Xiang
Zihan Guo
Jianhua Han
Hang Xu
Xiaodan Liang

Paper Information

arXiv ID: 2602.06037v1
Categories: cs.CV
Published: February 5, 2026
PDF: Download PDF

[Paper] Thinking with Geometry: Active Geometry Integration for Spatial Reasoning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Pseudo-Invertible Neural Networks

[Paper] Shared LoRA Subspaces for almost Strict Continual Learning

[Paper] Predicting Camera Pose from Perspective Descriptions for Spatial Reasoning

[Paper] SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs