[Paper] LISN: Language-Instructed Social Navigation with VLM-based Controller Modulating

Published: 2 months ago (December 10, 2025 at 01:54 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.09920v1

Overview

The paper introduces LISN‑Bench, the first simulation benchmark that evaluates mobile robots on language‑instructed social navigation. By combining natural‑language instruction following with classic collision‑avoidance, the authors push robot navigation toward real‑world human‑robot coexistence. Their proposed Social‑Nav‑Modulator architecture leverages a vision‑language model (VLM) to dynamically adjust costmaps and controller parameters, achieving a dramatic boost in success rates over existing baselines.

Key Contributions

LISN‑Bench: a ROS‑based, open‑source benchmark built on Rosnav‑Arena 3.0 that integrates diverse language instructions, scene understanding, and social constraints.
Social‑Nav‑Modulator: a hierarchical “fast‑slow” controller where a VLM runs at a low frequency to modulate the robot’s costmap and low‑level controller gains, decoupling heavy perception from real‑time actuation.
Empirical breakthrough: the system reaches a 91.3 % average success rate, outperforming the strongest baseline by 63 %, especially on tasks like “follow a person in a crowd” and “avoid forbidden zones”.
Public resources: code, benchmark scenarios, and pretrained models are released, enabling reproducible research and rapid prototyping.

Methodology

Benchmark design
- Built on the ROS‑compatible Rosnav‑Arena 3.0 simulator.
- Scenarios include static obstacles, moving pedestrians, and instruction‑forbidden regions (e.g., “do not cross the red carpet”).
- Each episode provides a natural‑language command and a goal pose.
Social‑Nav‑Modulator architecture
- Slow loop (VLM agent): Every ~1 s, a vision‑language model processes the RGB image, the current map, and the textual instruction. It outputs modulation signals: (i) adjustments to the costmap (e.g., raise cost in forbidden zones), and (ii) scaling factors for the low‑level controller (e.g., increase angular gain when a person is nearby).
- Fast loop (traditional controller): A standard DWA (Dynamic Window Approach) or TEB (Timed‑Elastic‑Band) planner runs at 10–20 Hz, consuming the modulated costmap and controller parameters to generate velocity commands.
- Decoupling advantage: Heavy VLM inference is amortized, keeping the robot’s control loop responsive while still benefiting from high‑level semantic reasoning.
Training & inference
- The VLM is fine‑tuned on a synthetic dataset of paired images, instructions, and desired costmap modifications.
- No end‑to‑end RL; the system remains modular, allowing developers to swap out planners or VLM backbones.

Results & Findings

Metric	Social‑Nav‑Modulator	Best Baseline (e.g., VLM‑Only)
Success Rate (overall)	91.3 %	56.2 %
Follow‑person in crowd	88.7 %	45.1 %
Forbidden‑zone avoidance	94.2 %	62.3 %
Average navigation time	12.4 s	15.8 s

Speed‑accuracy trade‑off: By running the VLM at a lower frequency, the system maintains real‑time responsiveness (≈20 Hz control loop) while still achieving higher success than a constantly‑running VLM.
Robustness to dynamic crowds: The costmap modulation quickly raises penalties around moving pedestrians, enabling smoother detours without sacrificing instruction compliance.
Ablation studies: Removing either costmap modulation or controller‑gain scaling drops performance by ~20 %, confirming the synergy of both signals.

Practical Implications

Plug‑and‑play navigation stack: Developers can integrate the Social‑Nav‑Modulator into existing ROS navigation pipelines with minimal changes—just replace the costmap server and expose a VLM inference node.
Natural‑language interfaces: Service robots (e.g., delivery bots in offices or hospitals) can now obey high‑level commands like “bring the coffee to the meeting room, but stay away from the fire‑exit corridor,” improving user trust.
Safety‑by‑instruction: Forbidden‑zone handling enables compliance with regulatory or site‑specific rules without hard‑coding static maps.
Scalable perception: The hierarchical design reduces GPU load, making it feasible on edge devices (NVIDIA Jetson, Intel NCS2) for real‑world deployments.
Benchmark as a development yardstick: LISN‑Bench offers a standardized testbed for evaluating future language‑guided navigation solutions, encouraging reproducibility and fair comparison.

Limitations & Future Work

Simulation‑only evaluation: Real‑world transfer is not demonstrated; domain gaps (lighting, sensor noise) could affect VLM perception.
Instruction complexity: Benchmarks focus on single‑sentence commands; handling multi‑step or ambiguous instructions remains open.
VLM latency: Although amortized, the VLM still introduces a ~1 s delay, which may be problematic in highly dynamic environments.
Scalability of fine‑tuning: The current VLM fine‑tuning relies on synthetic data; scaling to diverse indoor/outdoor domains may require larger, annotated corpora.

Future research directions include real‑robot experiments, hierarchical language planners for multi‑step tasks, and adaptive scheduling of VLM inference based on environmental dynamics.

Authors

Junting Chen
Yunchuan Li
Panfeng Jiang
Jiacheng Du
Zixuan Chen
Chenrui Tie
Jiajun Deng
Lin Shao

Paper Information

arXiv ID: 2512.09920v1
Categories: cs.RO, cs.AI, cs.CV
Published: December 10, 2025
PDF: Download PDF

[Paper] LISN: Language-Instructed Social Navigation with VLM-based Controller Modulating

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Particulate: Feed-Forward 3D Object Articulation

[Paper] Smudged Fingerprints: A Systematic Evaluation of the Robustness of AI Image Fingerprints

[Paper] DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry

[Paper] Parallax: Runtime Parallelization for Operator Fallbacks in Heterogeneous Edge Systems