[Paper] BEACON: Language-Conditioned Navigation Affordance Prediction under Occlusion
Source: arXiv - 2603.09961v1
Overview
The paper introduces BEACON, a system that lets a robot understand open‑ended, relational language commands (e.g., “go to the chair next to the table”) even when the target is hidden behind obstacles. By predicting a bird’s‑eye‑view (BEV) affordance heatmap from surround‑view RGB‑D sensors, BEACON can infer traversable spots in occluded regions—something traditional vision‑language models that operate only on visible pixels struggle with.
Key Contributions
- BEV affordance prediction: First method to map language‑conditioned navigation goals onto an ego‑centric top‑down heatmap that includes both visible and occluded space.
- Spatially‑aware VLM integration: Extends a pretrained vision‑language model with explicit spatial cues, allowing it to reason about “where” in addition to “what”.
- Depth‑driven BEV feature fusion: Combines depth‑derived top‑down geometry with VLM outputs, yielding a richer representation of the local scene.
- Occlusion‑focused benchmark: Builds a new Habitat‑based dataset that deliberately places target locations behind furniture or moving agents, exposing the limits of image‑space baselines.
- Significant performance boost: Achieves a 22.74 % absolute improvement in geodesic‑threshold accuracy over the previous state‑of‑the‑art on occluded targets.
Methodology
- Sensor setup – The robot captures four RGB‑D streams (front, left, right, back), giving a 360° surround view of its immediate surroundings.
- Vision‑Language backbone – A pretrained VLM (e.g., CLIP) processes the concatenated images together with the natural‑language instruction, producing a set of high‑level visual‑semantic embeddings.
- Spatial cue injection – Positional encodings that describe each camera’s orientation and field‑of‑view are added to the VLM’s token stream, teaching the model to associate language with specific directions.
- Depth‑to‑BEV conversion – Using the depth channel, each RGB‑D frame is lifted into a local 2‑D occupancy grid (a top‑down “floor plan”) that marks free space, obstacles, and unknown (potentially occluded) cells.
- Fusion & heatmap generation – The VLM embeddings are merged with the BEV occupancy grid via a lightweight transformer decoder, which outputs a heatmap where higher values indicate higher confidence that the cell is a feasible navigation target given the instruction.
- Target selection – The robot selects the peak of the heatmap (or samples from high‑confidence regions) and plans a short‑range motion toward that location.
Results & Findings
- On the occlusion‑rich validation set, BEACON reaches 71.3 % success (geodesic error ≤ 0.5 m) versus 48.6 % for the best image‑space baseline—a 22.74 pp gain.
- Ablation studies show that removing spatial encodings drops performance by ~8 pp, while discarding depth‑derived BEV features reduces accuracy by ~12 pp, confirming both components are essential.
- Qualitative visualizations illustrate the heatmap correctly lighting up hidden spots behind a couch or a moving person, where pixel‑level models produce empty or noisy predictions.
- The system runs at ≈10 fps on a single RTX 3080, making it viable for real‑time robot control.
Practical Implications
- Home service robots can now follow natural commands like “pick up the mug on the table behind the sofa” without needing a perfect line of sight.
- Warehouse automation benefits from robust goal inference when pallets or shelves block direct views, reducing the need for costly extra sensors.
- AR/VR assistants that operate on mobile devices can infer user‑intended interaction points even when parts of the scene are occluded, improving contextual overlays.
- The BEV‑centric representation aligns naturally with existing navigation stacks (e.g., ROS
nav2), allowing developers to plug BEACON’s heatmap directly into path planners. - Open‑source code and dataset (linked on the project page) give the community a baseline for extending language‑conditioned navigation to other modalities (LiDAR, semantic maps) or larger environments.
Limitations & Future Work
- Local scope – BEACON only predicts affordances within a bounded radius (≈3 m). Extending to larger, multi‑room spaces will require hierarchical mapping or memory mechanisms.
- Static depth assumption – The depth‑to‑BEV conversion treats the scene as static during inference; rapid moving obstacles could introduce errors.
- Reliance on pretrained VLMs – Performance is tied to the quality of the underlying vision‑language model; domain‑specific vocabularies may still need fine‑tuning.
- Real‑world transfer – The current evaluation is in simulation (Habitat). Bridging the sim‑to‑real gap (sensor noise, lighting variations) is an open challenge the authors plan to address with real‑robot experiments and domain‑adaptation techniques.
Authors
- Xinyu Gao
- Gang Chen
- Javier Alonso-Mora
Paper Information
- arXiv ID: 2603.09961v1
- Categories: cs.RO, cs.AI, cs.CV
- Published: March 10, 2026
- PDF: Download PDF