[Paper] From Obstacles to Etiquette: Robot Social Navigation with VLM-Informed Path Selection
Source: arXiv - 2602.09002v1
Overview
The paper introduces a new framework for social robot navigation that goes beyond simple obstacle avoidance. By coupling traditional geometric path planning with a vision‑language model (VLM) that understands human social cues, the system can pick routes that respect personal space, activity flow, and unwritten etiquette—making robots behave more like considerate pedestrians in real‑world settings.
Key Contributions
- Hybrid planning pipeline that first generates geometry‑based candidate trajectories and then ranks them using a task‑specific VLM.
- Fine‑tuned VLM that distills social reasoning from large foundation models into a lightweight network capable of real‑time inference on robot hardware.
- Context‑aware scoring function that evaluates paths against social expectations (e.g., “don’t cut in front of a walking group”, “avoid standing in front of a desk”).
- Comprehensive evaluation across four distinct social navigation scenarios, showing superior performance in personal‑space preservation and social zone compliance.
- Open‑source project page with code, videos, and dataset links for reproducibility and community extension.
Methodology
- Obstacle & Dynamics Extraction – Using onboard perception (LiDAR, RGB‑D), the robot builds a short‑term map of static obstacles and tracks nearby humans’ positions and velocities.
- Geometric Candidate Generation – A conventional planner (e.g., RRT* or lattice‑based) proposes a set of collision‑free trajectories that satisfy kinematic constraints.
- VLM‑Based Social Scoring – Each candidate is rendered into a short visual snippet (bird‑eye view + human pose overlays). The snippet, together with a textual prompt describing the current context (“navigate through a hallway while people are chatting”), is fed to a fine‑tuned vision‑language model. The VLM outputs a social suitability score reflecting how well the path aligns with common‑sense etiquette.
- Path Selection & Control – The robot picks the highest‑scoring trajectory and passes it to a low‑level controller for execution. If the environment changes, the loop repeats at ~10 Hz, enabling real‑time adaptation.
The VLM is trained on a curated dataset of human‑annotated “good” vs. “bad” navigation examples, allowing it to capture nuanced rules (e.g., “don’t block a person’s line of sight to a screen”) without hand‑coding each rule.
Results & Findings
| Scenario | Personal‑Space Violation (s) | Pedestrian‑Facing Time (s) | Social‑Zone Intrusions |
|---|---|---|---|
| Corridor with crossing flow | 0.12 (best) | 0.35 (best) | 0 |
| Office hallway (people at desks) | 0.08 (best) | 0.22 (best) | 0 |
| Café (tables & standing groups) | 0.15 (best) | 0.40 (best) | 0 |
| Museum exhibit (dense crowd) | 0.10 (best) | 0.30 (best) | 0 |
- The VLM‑informed selector consistently outperformed pure geometric planners and rule‑based social planners on all metrics.
- Latency remained under 100 ms per planning cycle, confirming suitability for on‑board deployment.
- Qualitative video demos show the robot naturally yielding, taking wider turns, and even “waiting politely” when a human pauses in its path.
Practical Implications
- Robotics SDKs can integrate the VLM scoring module as a plug‑in, upgrading existing navigation stacks with social awareness without redesigning the whole planner.
- Warehouse & delivery robots can reduce accidental interruptions of human workers, potentially lowering safety incidents and improving coworker acceptance.
- Service robots (e.g., in hotels or hospitals) gain a more “human‑like” presence, which can boost user comfort and trust.
- The lightweight VLM runs on edge GPUs (e.g., NVIDIA Jetson) meaning developers don’t need cloud inference, preserving privacy and latency.
- The approach is modular: developers can swap the geometric planner, adjust the prompt language, or fine‑tune the VLM on domain‑specific etiquette (e.g., cultural norms for different regions).
Limitations & Future Work
- The current VLM is trained on a limited set of indoor scenarios; outdoor or highly dynamic environments (e.g., crowded streets) may require additional data.
- Social reasoning is implicit in the model; debugging a specific undesirable behavior can be non‑trivial compared to explicit rule‑based systems.
- The framework assumes reliable human tracking; occlusions or sensor failures could degrade the VLM’s scoring accuracy.
- Future research directions include: extending the dataset to cover cross‑cultural etiquette, incorporating multimodal cues (audio, gaze), and exploring continual learning so robots can adapt to new social norms on the fly.
Authors
- Zilin Fang
- Anxing Xiao
- David Hsu
- Gim Hee Lee
Paper Information
- arXiv ID: 2602.09002v1
- Categories: cs.RO, cs.AI
- Published: February 9, 2026
- PDF: Download PDF