[Paper] CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation

Published: 3 days ago (February 20, 2026 at 01:46 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.18424v1

Overview

The paper introduces CapNav, a new benchmark that tests Vision‑Language Models (VLMs) on indoor navigation tasks conditioned on the physical capabilities of different agents (e.g., a wheeled vacuum, a quadruped robot, or a human). By coupling language understanding with realistic mobility constraints, the authors expose a blind spot in current VLMs: they excel at “where to go” but stumble when the route must respect size, locomotion mode, or interaction limits.

Key Contributions

Capability‑Conditioned Navigation benchmark: 45 real‑world indoor scenes, 473 navigation episodes, and 2 365 question‑answer pairs that encode five distinct agents with explicit size, locomotion, and interaction specs.
Comprehensive evaluation suite: 13 state‑of‑the‑art VLMs (including CLIP‑based, Flamingo, GPT‑4‑V, etc.) are tested on both navigation success metrics and QA accuracy.
Empirical insight: Demonstrates a steep performance drop as agent constraints tighten, highlighting specific failure modes (e.g., reasoning about stair‑climbing ability, doorway width).
Open‑source release: Dataset, evaluation scripts, and baseline implementations are publicly available, encouraging reproducible research and community extensions.
Analysis of spatial‑dimensional reasoning: Provides a taxonomy of obstacle types (height‑only, width‑only, dynamic) and shows which categories are hardest for current models.

Methodology

Agent Specification: Each of the five agents (e.g., “sweeping robot”, “humanoid”, “quadruped”) is described by a JSON block listing dimensions (height, width, radius), locomotion mode (wheel, leg, biped), and interaction abilities (can open doors, can climb stairs).
Scene Collection: 45 indoor environments (apartments, offices, labs) were captured with 360° RGB‑D panoramas and annotated with semantic maps (walls, doors, stairs, obstacles).
Task Generation: For every scene, navigation queries such as “Take the robot to the kitchen and fetch the mug” are paired with a capability‑aware feasibility check (e.g., “Can the robot fit through the hallway?”). This yields 473 navigation episodes.
QA Pair Creation: Each episode is supplemented with 5‑6 natural‑language questions probing the model’s understanding of constraints (e.g., “Will the robot be able to cross the threshold?”).
Evaluation Protocol:
- Success Rate (SR) – did the model reach the target while respecting constraints?
- Path Length Ratio (PLR) – efficiency compared to an oracle planner.
- QA Accuracy – correctness of answers to the constraint‑focused questions.
- All VLMs are prompted with the same multimodal input (scene images + textual query) and allowed to output a navigation plan or answer.

Results & Findings

Model	Success Rate (unconstrained)	Success Rate (tightest constraints)	QA Accuracy (overall)
CLIP‑ViT‑B/32	71 %	32 %	58 %
Flamingo‑3B	78 %	35 %	62 %
GPT‑4‑V	84 %	41 %	68 %
CapNav‑Fine‑Tuned (baseline)	88 %	55 %	73 %

Performance degrades sharply as the agent’s mobility envelope shrinks; models that excel on open‑floor navigation fall below 40 % success on stair‑climbing or narrow‑door scenarios.
Spatial‑dimensional reasoning is the bottleneck: errors cluster around obstacles that require evaluating both height and width (e.g., “Can the robot pass under the low table while going through a narrow doorway?”).
Fine‑tuning on CapNav data yields a ~15 % boost in constrained scenarios, suggesting that the benchmark can drive targeted improvements.

Practical Implications

Robotics developers can use CapNav to sanity‑check their VLM‑based planners before deploying on heterogeneous fleets (cleaning bots, delivery drones, service robots).
Product designers gain a systematic way to verify that a new robot’s form factor aligns with expected indoor use‑cases, reducing costly field trials.
Human‑computer interaction: Voice‑controlled assistants that issue navigation commands (e.g., “Send the robot to the living room”) can now be equipped with a quick capability check, preventing impossible requests.
Simulation‑to‑real transfer: CapNav’s real‑world scenes expose VLMs to realistic visual noise and layout irregularities, encouraging more robust embeddings that survive the “sim‑gap”.

Limitations & Future Work

Static environments only – moving obstacles (people, pets) are not modeled, limiting assessment of dynamic reasoning.
Agent set is fixed; extending to custom robot geometries will require additional annotation pipelines.
The benchmark relies on pre‑computed semantic maps; end‑to‑end perception (simultaneous mapping + navigation) remains an open challenge.
Authors suggest future work on continual learning where a VLM updates its capability model as it encounters new hardware, and on multimodal planning that fuses language, proprioception, and tactile feedback.

Authors

Xia Su
Ruiqi Chen
Benlin Liu
Jingwei Ma
Zonglin Di
Ranjay Krishna
Jon Froehlich

Paper Information

arXiv ID: 2602.18424v1
Categories: cs.CV, cs.RO
Published: February 20, 2026
PDF: Download PDF

[Paper] CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory

[Paper] SARAH: Spatially Aware Real-time Agentic Humans

[Paper] The Geometry of Noise: Why Diffusion Models Don't Need Noise Conditioning

[Paper] Spatio-Spectroscopic Representation Learning using Unsupervised Convolutional Long-Short Term Memory Networks