[Paper] BEACON: Language-Conditioned Navigation Affordance Prediction under Occlusion

Published: 13 hours ago (March 10, 2026 at 01:56 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2603.09961v1

Overview

The paper introduces BEACON, a system that lets a robot understand open‑ended, relational language commands (e.g., “go to the chair next to the table”) even when the target is hidden behind obstacles. By predicting a bird’s‑eye‑view (BEV) affordance heatmap from surround‑view RGB‑D sensors, BEACON can infer traversable spots in occluded regions—something traditional vision‑language models that operate only on visible pixels struggle with.

Key Contributions

BEV affordance prediction: First method to map language‑conditioned navigation goals onto an ego‑centric top‑down heatmap that includes both visible and occluded space.
Spatially‑aware VLM integration: Extends a pretrained vision‑language model with explicit spatial cues, allowing it to reason about “where” in addition to “what”.
Depth‑driven BEV feature fusion: Combines depth‑derived top‑down geometry with VLM outputs, yielding a richer representation of the local scene.
Occlusion‑focused benchmark: Builds a new Habitat‑based dataset that deliberately places target locations behind furniture or moving agents, exposing the limits of image‑space baselines.
Significant performance boost: Achieves a 22.74 % absolute improvement in geodesic‑threshold accuracy over the previous state‑of‑the‑art on occluded targets.

Methodology

Sensor setup – The robot captures four RGB‑D streams (front, left, right, back), giving a 360° surround view of its immediate surroundings.
Vision‑Language backbone – A pretrained VLM (e.g., CLIP) processes the concatenated images together with the natural‑language instruction, producing a set of high‑level visual‑semantic embeddings.
Spatial cue injection – Positional encodings that describe each camera’s orientation and field‑of‑view are added to the VLM’s token stream, teaching the model to associate language with specific directions.
Depth‑to‑BEV conversion – Using the depth channel, each RGB‑D frame is lifted into a local 2‑D occupancy grid (a top‑down “floor plan”) that marks free space, obstacles, and unknown (potentially occluded) cells.
Fusion & heatmap generation – The VLM embeddings are merged with the BEV occupancy grid via a lightweight transformer decoder, which outputs a heatmap where higher values indicate higher confidence that the cell is a feasible navigation target given the instruction.
Target selection – The robot selects the peak of the heatmap (or samples from high‑confidence regions) and plans a short‑range motion toward that location.

Results & Findings

On the occlusion‑rich validation set, BEACON reaches 71.3 % success (geodesic error ≤ 0.5 m) versus 48.6 % for the best image‑space baseline—a 22.74 pp gain.
Ablation studies show that removing spatial encodings drops performance by ~8 pp, while discarding depth‑derived BEV features reduces accuracy by ~12 pp, confirming both components are essential.
Qualitative visualizations illustrate the heatmap correctly lighting up hidden spots behind a couch or a moving person, where pixel‑level models produce empty or noisy predictions.
The system runs at ≈10 fps on a single RTX 3080, making it viable for real‑time robot control.

Practical Implications

Home service robots can now follow natural commands like “pick up the mug on the table behind the sofa” without needing a perfect line of sight.
Warehouse automation benefits from robust goal inference when pallets or shelves block direct views, reducing the need for costly extra sensors.
AR/VR assistants that operate on mobile devices can infer user‑intended interaction points even when parts of the scene are occluded, improving contextual overlays.
The BEV‑centric representation aligns naturally with existing navigation stacks (e.g., ROS nav2), allowing developers to plug BEACON’s heatmap directly into path planners.
Open‑source code and dataset (linked on the project page) give the community a baseline for extending language‑conditioned navigation to other modalities (LiDAR, semantic maps) or larger environments.

Limitations & Future Work

Local scope – BEACON only predicts affordances within a bounded radius (≈3 m). Extending to larger, multi‑room spaces will require hierarchical mapping or memory mechanisms.
Static depth assumption – The depth‑to‑BEV conversion treats the scene as static during inference; rapid moving obstacles could introduce errors.
Reliance on pretrained VLMs – Performance is tied to the quality of the underlying vision‑language model; domain‑specific vocabularies may still need fine‑tuning.
Real‑world transfer – The current evaluation is in simulation (Habitat). Bridging the sim‑to‑real gap (sensor noise, lighting variations) is an open challenge the authors plan to address with real‑robot experiments and domain‑adaptation techniques.

Authors

Xinyu Gao
Gang Chen
Javier Alonso-Mora

Paper Information

arXiv ID: 2603.09961v1
Categories: cs.RO, cs.AI, cs.CV
Published: March 10, 2026
PDF: Download PDF

[Paper] BEACON: Language-Conditioned Navigation Affordance Prediction under Occlusion

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] From Data Statistics to Feature Geometry: How Correlations Shape Superposition

[Paper] From Semantics to Pixels: Coarse-to-Fine Masked Autoencoders for Hierarchical Visual Understanding

[Paper] No Image, No Problem: End-to-End Multi-Task Cardiac Analysis from Undersampled k-Space

[Paper] Adaptive Clinical-Aware Latent Diffusion for Multimodal Brain Image Generation and Missing Modality Imputation