[Paper] LISN: Language-Instructed Social Navigation with VLM-based Controller Modulating
Source: arXiv - 2512.09920v1
Overview
The paper introduces LISN‑Bench, the first simulation benchmark that evaluates mobile robots on language‑instructed social navigation. By combining natural‑language instruction following with classic collision‑avoidance, the authors push robot navigation toward real‑world human‑robot coexistence. Their proposed Social‑Nav‑Modulator architecture leverages a vision‑language model (VLM) to dynamically adjust costmaps and controller parameters, achieving a dramatic boost in success rates over existing baselines.
Key Contributions
- LISN‑Bench: a ROS‑based, open‑source benchmark built on Rosnav‑Arena 3.0 that integrates diverse language instructions, scene understanding, and social constraints.
- Social‑Nav‑Modulator: a hierarchical “fast‑slow” controller where a VLM runs at a low frequency to modulate the robot’s costmap and low‑level controller gains, decoupling heavy perception from real‑time actuation.
- Empirical breakthrough: the system reaches a 91.3 % average success rate, outperforming the strongest baseline by 63 %, especially on tasks like “follow a person in a crowd” and “avoid forbidden zones”.
- Public resources: code, benchmark scenarios, and pretrained models are released, enabling reproducible research and rapid prototyping.
Methodology
-
Benchmark design
- Built on the ROS‑compatible Rosnav‑Arena 3.0 simulator.
- Scenarios include static obstacles, moving pedestrians, and instruction‑forbidden regions (e.g., “do not cross the red carpet”).
- Each episode provides a natural‑language command and a goal pose.
-
Social‑Nav‑Modulator architecture
- Slow loop (VLM agent): Every ~1 s, a vision‑language model processes the RGB image, the current map, and the textual instruction. It outputs modulation signals: (i) adjustments to the costmap (e.g., raise cost in forbidden zones), and (ii) scaling factors for the low‑level controller (e.g., increase angular gain when a person is nearby).
- Fast loop (traditional controller): A standard DWA (Dynamic Window Approach) or TEB (Timed‑Elastic‑Band) planner runs at 10–20 Hz, consuming the modulated costmap and controller parameters to generate velocity commands.
- Decoupling advantage: Heavy VLM inference is amortized, keeping the robot’s control loop responsive while still benefiting from high‑level semantic reasoning.
-
Training & inference
- The VLM is fine‑tuned on a synthetic dataset of paired images, instructions, and desired costmap modifications.
- No end‑to‑end RL; the system remains modular, allowing developers to swap out planners or VLM backbones.
Results & Findings
| Metric | Social‑Nav‑Modulator | Best Baseline (e.g., VLM‑Only) |
|---|---|---|
| Success Rate (overall) | 91.3 % | 56.2 % |
| Follow‑person in crowd | 88.7 % | 45.1 % |
| Forbidden‑zone avoidance | 94.2 % | 62.3 % |
| Average navigation time | 12.4 s | 15.8 s |
- Speed‑accuracy trade‑off: By running the VLM at a lower frequency, the system maintains real‑time responsiveness (≈20 Hz control loop) while still achieving higher success than a constantly‑running VLM.
- Robustness to dynamic crowds: The costmap modulation quickly raises penalties around moving pedestrians, enabling smoother detours without sacrificing instruction compliance.
- Ablation studies: Removing either costmap modulation or controller‑gain scaling drops performance by ~20 %, confirming the synergy of both signals.
Practical Implications
- Plug‑and‑play navigation stack: Developers can integrate the Social‑Nav‑Modulator into existing ROS navigation pipelines with minimal changes—just replace the costmap server and expose a VLM inference node.
- Natural‑language interfaces: Service robots (e.g., delivery bots in offices or hospitals) can now obey high‑level commands like “bring the coffee to the meeting room, but stay away from the fire‑exit corridor,” improving user trust.
- Safety‑by‑instruction: Forbidden‑zone handling enables compliance with regulatory or site‑specific rules without hard‑coding static maps.
- Scalable perception: The hierarchical design reduces GPU load, making it feasible on edge devices (NVIDIA Jetson, Intel NCS2) for real‑world deployments.
- Benchmark as a development yardstick: LISN‑Bench offers a standardized testbed for evaluating future language‑guided navigation solutions, encouraging reproducibility and fair comparison.
Limitations & Future Work
- Simulation‑only evaluation: Real‑world transfer is not demonstrated; domain gaps (lighting, sensor noise) could affect VLM perception.
- Instruction complexity: Benchmarks focus on single‑sentence commands; handling multi‑step or ambiguous instructions remains open.
- VLM latency: Although amortized, the VLM still introduces a ~1 s delay, which may be problematic in highly dynamic environments.
- Scalability of fine‑tuning: The current VLM fine‑tuning relies on synthetic data; scaling to diverse indoor/outdoor domains may require larger, annotated corpora.
Future research directions include real‑robot experiments, hierarchical language planners for multi‑step tasks, and adaptive scheduling of VLM inference based on environmental dynamics.
Authors
- Junting Chen
- Yunchuan Li
- Panfeng Jiang
- Jiacheng Du
- Zixuan Chen
- Chenrui Tie
- Jiajun Deng
- Lin Shao
Paper Information
- arXiv ID: 2512.09920v1
- Categories: cs.RO, cs.AI, cs.CV
- Published: December 10, 2025
- PDF: Download PDF