[Paper] SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL

Published: (December 3, 2025 at 01:50 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.04069v1

Overview

The paper introduces SpaceTools, a new framework that lets large vision‑language models (VLMs) reason about precise geometry by learning how to call and combine multiple visual “tools” (e.g., depth estimators, segmentation nets, pose detectors). By training the model with Double Interactive Reinforcement Learning (DIRL), the authors achieve state‑of‑the‑art spatial reasoning on several benchmarks and demonstrate reliable manipulation on a real 7‑DOF robot.

Key Contributions

  • DIRL training pipeline – a two‑phase reinforcement‑learning scheme that first teaches a VLM using expert tool‑specialist demonstrations, then lets it explore and refine multi‑tool coordination.
  • Tool‑augmented spatial reasoning – the model learns to select, invoke, and fuse outputs from several vision tools on the fly, rather than relying on a fixed pipeline or handcrafted prompts.
  • SpaceTools model – achieves the best reported scores on RoboSpatial‑Home (+12 % over supervised fine‑tuning, +16 % over vanilla RL), BLINK, and BOP‑ASK benchmarks.
  • Real‑world validation – the approach is deployed on a 7‑DOF robot arm, showing robust pick‑and‑place and pose‑adjustment tasks that require metric‑level accuracy.
  • Open‑source release – code, pretrained checkpoints, and an interactive demo are made publicly available.

Methodology

  1. Tool Suite – The system bundles off‑the‑shelf visual modules (depth, semantic segmentation, object pose estimation). Each tool can be queried with a natural‑language instruction and returns a structured output (e.g., a depth map).
  2. Teaching Phase
    • Tool specialist: a single‑tool agent is trained via interactive RL to master a specific reasoning sub‑task (e.g., “find the nearest cup”).
    • Frontier model: a larger VLM that can call any tool but has no coordination skill yet.
    • Demonstrations from the specialist are mixed with traces from the frontier model to create a curriculum that shows what to do and how to call the right tool.
  3. Exploration Phase – The frontier model continues training with RL, receiving reward signals based on task success (e.g., correct spatial relation classification) and a penalty for unnecessary tool calls. This encourages efficient, purposeful tool usage.
  4. Policy Architecture – The VLM’s language encoder is fused with a lightweight controller that predicts a tool‑selection distribution and a textual query for the chosen tool. The tool’s output is fed back into the language model, closing the perception‑action loop.

Results & Findings

BenchmarkPrior SOTASpaceTools (DIRL)Gain
RoboSpatial‑Home68.4 %80.5 %+12 %
BLINK (spatial QA)71.2 %78.9 %+7.7 %
BOP‑ASK (pose QA)64.0 %73.5 %+9.5 %
  • Tool usage efficiency: on average the model calls only 1.8 tools per query, compared to 3.4 in a naïve exhaustive approach.
  • Real‑world robot tests: 94 % success rate on a 7‑DOF pick‑place task that required sub‑centimeter alignment, outperforming a baseline VLM that relied on a single depth estimator (71 % success).
  • Ablation: removing the teaching phase drops performance by ~8 %, confirming the importance of expert demonstrations for multi‑tool coordination.

Practical Implications

  • Embodied AI & Robotics – Developers can plug SpaceTools into existing robot stacks to give agents metric‑level spatial awareness without hand‑crafting perception pipelines.
  • Modular AI services – The DIRL framework can be reused to teach VLMs to orchestrate any set of APIs (e.g., OCR, 3‑D reconstruction), opening doors for more flexible AI assistants.
  • Reduced engineering overhead – Instead of manually chaining depth → segmentation → pose models, the system learns the optimal order, saving time and computational budget.
  • Better UI for mixed‑reality – Applications that need precise object placement (AR furniture layout, remote tele‑operation) can leverage the model’s ability to ask for the exact tool it needs at runtime.

Limitations & Future Work

  • Tool dependency – Performance hinges on the quality of the underlying visual tools; noisy depth or pose estimators can still degrade results.
  • Scalability of tool set – While DIRL handles a handful of tools well, the search space grows quickly with dozens of modules, requiring smarter curriculum or hierarchical selection strategies.
  • Generalization to unseen domains – The benchmarks focus on indoor household scenes; extending to outdoor or industrial environments may need domain‑specific tool fine‑tuning.
  • Future directions suggested by the authors include:
    1. Hierarchical DIRL to manage larger tool libraries.
    2. Curriculum learning that adapts tool selection based on task difficulty.
    3. Tighter integration with low‑level robot controllers for closed‑loop manipulation.

Authors

  • Siyi Chen
  • Mikaela Angelina Uy
  • Chan Hee Song
  • Faisal Ladhak
  • Adithyavairavan Murali
  • Qing Qu
  • Stan Birchfield
  • Valts Blukis
  • Jonathan Tremblay

Paper Information

  • arXiv ID: 2512.04069v1
  • Categories: cs.CV, cs.RO
  • Published: December 3, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »