[Paper] CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation

Published: (February 20, 2026 at 01:46 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.18424v1

Overview

The paper introduces CapNav, a new benchmark that tests Vision‑Language Models (VLMs) on indoor navigation tasks conditioned on the physical capabilities of different agents (e.g., a wheeled vacuum, a quadruped robot, or a human). By coupling language understanding with realistic mobility constraints, the authors expose a blind spot in current VLMs: they excel at “where to go” but stumble when the route must respect size, locomotion mode, or interaction limits.

Key Contributions

  • Capability‑Conditioned Navigation benchmark: 45 real‑world indoor scenes, 473 navigation episodes, and 2 365 question‑answer pairs that encode five distinct agents with explicit size, locomotion, and interaction specs.
  • Comprehensive evaluation suite: 13 state‑of‑the‑art VLMs (including CLIP‑based, Flamingo, GPT‑4‑V, etc.) are tested on both navigation success metrics and QA accuracy.
  • Empirical insight: Demonstrates a steep performance drop as agent constraints tighten, highlighting specific failure modes (e.g., reasoning about stair‑climbing ability, doorway width).
  • Open‑source release: Dataset, evaluation scripts, and baseline implementations are publicly available, encouraging reproducible research and community extensions.
  • Analysis of spatial‑dimensional reasoning: Provides a taxonomy of obstacle types (height‑only, width‑only, dynamic) and shows which categories are hardest for current models.

Methodology

  1. Agent Specification: Each of the five agents (e.g., “sweeping robot”, “humanoid”, “quadruped”) is described by a JSON block listing dimensions (height, width, radius), locomotion mode (wheel, leg, biped), and interaction abilities (can open doors, can climb stairs).
  2. Scene Collection: 45 indoor environments (apartments, offices, labs) were captured with 360° RGB‑D panoramas and annotated with semantic maps (walls, doors, stairs, obstacles).
  3. Task Generation: For every scene, navigation queries such as “Take the robot to the kitchen and fetch the mug” are paired with a capability‑aware feasibility check (e.g., “Can the robot fit through the hallway?”). This yields 473 navigation episodes.
  4. QA Pair Creation: Each episode is supplemented with 5‑6 natural‑language questions probing the model’s understanding of constraints (e.g., “Will the robot be able to cross the threshold?”).
  5. Evaluation Protocol:
    • Success Rate (SR) – did the model reach the target while respecting constraints?
    • Path Length Ratio (PLR) – efficiency compared to an oracle planner.
    • QA Accuracy – correctness of answers to the constraint‑focused questions.
    • All VLMs are prompted with the same multimodal input (scene images + textual query) and allowed to output a navigation plan or answer.

Results & Findings

ModelSuccess Rate (unconstrained)Success Rate (tightest constraints)QA Accuracy (overall)
CLIP‑ViT‑B/3271 %32 %58 %
Flamingo‑3B78 %35 %62 %
GPT‑4‑V84 %41 %68 %
CapNav‑Fine‑Tuned (baseline)88 %55 %73 %
  • Performance degrades sharply as the agent’s mobility envelope shrinks; models that excel on open‑floor navigation fall below 40 % success on stair‑climbing or narrow‑door scenarios.
  • Spatial‑dimensional reasoning is the bottleneck: errors cluster around obstacles that require evaluating both height and width (e.g., “Can the robot pass under the low table while going through a narrow doorway?”).
  • Fine‑tuning on CapNav data yields a ~15 % boost in constrained scenarios, suggesting that the benchmark can drive targeted improvements.

Practical Implications

  • Robotics developers can use CapNav to sanity‑check their VLM‑based planners before deploying on heterogeneous fleets (cleaning bots, delivery drones, service robots).
  • Product designers gain a systematic way to verify that a new robot’s form factor aligns with expected indoor use‑cases, reducing costly field trials.
  • Human‑computer interaction: Voice‑controlled assistants that issue navigation commands (e.g., “Send the robot to the living room”) can now be equipped with a quick capability check, preventing impossible requests.
  • Simulation‑to‑real transfer: CapNav’s real‑world scenes expose VLMs to realistic visual noise and layout irregularities, encouraging more robust embeddings that survive the “sim‑gap”.

Limitations & Future Work

  • Static environments only – moving obstacles (people, pets) are not modeled, limiting assessment of dynamic reasoning.
  • Agent set is fixed; extending to custom robot geometries will require additional annotation pipelines.
  • The benchmark relies on pre‑computed semantic maps; end‑to‑end perception (simultaneous mapping + navigation) remains an open challenge.
  • Authors suggest future work on continual learning where a VLM updates its capability model as it encounters new hardware, and on multimodal planning that fuses language, proprioception, and tactile feedback.

Authors

  • Xia Su
  • Ruiqi Chen
  • Benlin Liu
  • Jingwei Ma
  • Zonglin Di
  • Ranjay Krishna
  • Jon Froehlich

Paper Information

  • arXiv ID: 2602.18424v1
  • Categories: cs.CV, cs.RO
  • Published: February 20, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »