[Paper] CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation
Source: arXiv - 2602.18424v1
Overview
The paper introduces CapNav, a new benchmark that tests Vision‑Language Models (VLMs) on indoor navigation tasks conditioned on the physical capabilities of different agents (e.g., a wheeled vacuum, a quadruped robot, or a human). By coupling language understanding with realistic mobility constraints, the authors expose a blind spot in current VLMs: they excel at “where to go” but stumble when the route must respect size, locomotion mode, or interaction limits.
Key Contributions
- Capability‑Conditioned Navigation benchmark: 45 real‑world indoor scenes, 473 navigation episodes, and 2 365 question‑answer pairs that encode five distinct agents with explicit size, locomotion, and interaction specs.
- Comprehensive evaluation suite: 13 state‑of‑the‑art VLMs (including CLIP‑based, Flamingo, GPT‑4‑V, etc.) are tested on both navigation success metrics and QA accuracy.
- Empirical insight: Demonstrates a steep performance drop as agent constraints tighten, highlighting specific failure modes (e.g., reasoning about stair‑climbing ability, doorway width).
- Open‑source release: Dataset, evaluation scripts, and baseline implementations are publicly available, encouraging reproducible research and community extensions.
- Analysis of spatial‑dimensional reasoning: Provides a taxonomy of obstacle types (height‑only, width‑only, dynamic) and shows which categories are hardest for current models.
Methodology
- Agent Specification: Each of the five agents (e.g., “sweeping robot”, “humanoid”, “quadruped”) is described by a JSON block listing dimensions (height, width, radius), locomotion mode (wheel, leg, biped), and interaction abilities (can open doors, can climb stairs).
- Scene Collection: 45 indoor environments (apartments, offices, labs) were captured with 360° RGB‑D panoramas and annotated with semantic maps (walls, doors, stairs, obstacles).
- Task Generation: For every scene, navigation queries such as “Take the robot to the kitchen and fetch the mug” are paired with a capability‑aware feasibility check (e.g., “Can the robot fit through the hallway?”). This yields 473 navigation episodes.
- QA Pair Creation: Each episode is supplemented with 5‑6 natural‑language questions probing the model’s understanding of constraints (e.g., “Will the robot be able to cross the threshold?”).
- Evaluation Protocol:
- Success Rate (SR) – did the model reach the target while respecting constraints?
- Path Length Ratio (PLR) – efficiency compared to an oracle planner.
- QA Accuracy – correctness of answers to the constraint‑focused questions.
- All VLMs are prompted with the same multimodal input (scene images + textual query) and allowed to output a navigation plan or answer.
Results & Findings
| Model | Success Rate (unconstrained) | Success Rate (tightest constraints) | QA Accuracy (overall) |
|---|---|---|---|
| CLIP‑ViT‑B/32 | 71 % | 32 % | 58 % |
| Flamingo‑3B | 78 % | 35 % | 62 % |
| GPT‑4‑V | 84 % | 41 % | 68 % |
| CapNav‑Fine‑Tuned (baseline) | 88 % | 55 % | 73 % |
- Performance degrades sharply as the agent’s mobility envelope shrinks; models that excel on open‑floor navigation fall below 40 % success on stair‑climbing or narrow‑door scenarios.
- Spatial‑dimensional reasoning is the bottleneck: errors cluster around obstacles that require evaluating both height and width (e.g., “Can the robot pass under the low table while going through a narrow doorway?”).
- Fine‑tuning on CapNav data yields a ~15 % boost in constrained scenarios, suggesting that the benchmark can drive targeted improvements.
Practical Implications
- Robotics developers can use CapNav to sanity‑check their VLM‑based planners before deploying on heterogeneous fleets (cleaning bots, delivery drones, service robots).
- Product designers gain a systematic way to verify that a new robot’s form factor aligns with expected indoor use‑cases, reducing costly field trials.
- Human‑computer interaction: Voice‑controlled assistants that issue navigation commands (e.g., “Send the robot to the living room”) can now be equipped with a quick capability check, preventing impossible requests.
- Simulation‑to‑real transfer: CapNav’s real‑world scenes expose VLMs to realistic visual noise and layout irregularities, encouraging more robust embeddings that survive the “sim‑gap”.
Limitations & Future Work
- Static environments only – moving obstacles (people, pets) are not modeled, limiting assessment of dynamic reasoning.
- Agent set is fixed; extending to custom robot geometries will require additional annotation pipelines.
- The benchmark relies on pre‑computed semantic maps; end‑to‑end perception (simultaneous mapping + navigation) remains an open challenge.
- Authors suggest future work on continual learning where a VLM updates its capability model as it encounters new hardware, and on multimodal planning that fuses language, proprioception, and tactile feedback.
Authors
- Xia Su
- Ruiqi Chen
- Benlin Liu
- Jingwei Ma
- Zonglin Di
- Ranjay Krishna
- Jon Froehlich
Paper Information
- arXiv ID: 2602.18424v1
- Categories: cs.CV, cs.RO
- Published: February 20, 2026
- PDF: Download PDF