[Paper] LookPlanGraph: Embodied Instruction Following Method with VLM Graph Augmentation
Source: arXiv - 2512.21243v1
Overview
The paper introduces LookPlanGraph, a new approach for embodied instruction‑following that keeps a robot’s internal scene graph up‑to‑date while it executes a task. By continuously fusing egocentric visual input with a Vision‑Language Model (VLM), the system can verify existing object priors and discover new ones on the fly, dramatically improving robustness when the environment changes between planning and execution.
Key Contributions
- Dynamic scene‑graph augmentation: Combines a static graph of known assets with real‑time updates derived from the robot’s camera feed.
- VLM‑driven perception loop: Uses a large Vision‑Language Model to interpret egocentric images and map them to graph nodes (objects, locations, relations).
- GraSIF dataset: A curated benchmark of 514 instruction‑following tasks spanning SayPlan Office, BEHAVIOR‑1K, and VirtualHome RobotHow, together with an automated validation framework.
- Empirical validation: Shows consistent performance gains over static‑graph baselines in both simulated (VirtualHome, OmniGibson) and real‑world robot experiments.
- Open‑source release: Code, dataset, and a project page are publicly available, encouraging reproducibility and community extensions.
Methodology
- Initial Graph Construction – Before a task begins, a static scene graph is built from known assets (room layout, furniture, typical object locations). This graph contains priors about where objects are likely to be.
- LLM Planner – A Large Language Model receives the natural‑language instruction and the current graph, then generates a high‑level plan (e.g., “pick up the mug from the kitchen counter”).
- Egocentric Perception Loop – As the robot follows the plan, its forward‑facing camera streams images to a Vision‑Language Model (e.g., CLIP‑based or Flamingo‑style). The VLM extracts object labels, spatial cues, and relational statements (“a red mug is on the table”).
- Graph Augmentation – The extracted information is matched against existing priors:
- Verification – Confirms that an expected object is still where the graph says it is.
- Discovery – Inserts new nodes or updates positions when the VLM spots an object that was missing or moved.
- Re‑planning (optional) – If the graph changes significantly (e.g., a required object is not found), the LLM can be prompted again with the updated graph to adjust the plan.
The whole pipeline runs in a tight perception‑planning loop, allowing the robot to react to dynamic environments without rebuilding the entire graph from scratch.
Results & Findings
| Environment | Baseline (static graph) | LookPlanGraph | Relative Gain |
|---|---|---|---|
| VirtualHome (object relocation) | 62 % success | 78 % | +16 % |
| OmniGibson (randomized furniture) | 55 % success | 71 % | +16 % |
| Real‑world tabletop task | 48 % success | 66 % | +18 % |
- Higher task completion: The dynamic updates reduced failure modes caused by stale object locations.
- Robustness to unseen changes: Even when objects were moved to entirely new rooms, the VLM could detect them and the planner adapted accordingly.
- Efficiency: Graph updates required only a few milliseconds per frame, keeping the overall latency suitable for real‑time control.
The GraSIF benchmark also demonstrated that the method scales across diverse instruction styles and scene complexities.
Practical Implications
- Home and office service robots can now handle everyday disturbances (e.g., a coffee mug moved to a different desk) without human intervention.
- Warehouse automation benefits from on‑the‑fly verification of item locations, reducing the need for costly periodic re‑scanning of the entire floor.
- Human‑robot collaboration becomes smoother: the robot can ask clarifying questions or re‑plan when it cannot locate a requested object, mirroring natural teamwork.
- Developer workflow: By exposing the graph‑augmentation module as a plug‑and‑play component, engineers can integrate it into existing LLM‑based planners with minimal code changes.
- Data efficiency: Since only the egocentric view is processed, the system avoids the overhead of building full 3‑D reconstructions, making it viable on edge devices with limited compute.
Limitations & Future Work
- Reliance on VLM accuracy: Misclassifications in the visual stream can propagate erroneous graph updates, especially for small or occluded objects.
- Static priors still needed: The initial graph must contain a reasonable set of asset priors; completely unknown environments may require a separate discovery phase.
- Scalability to large, cluttered spaces: While the current implementation handles typical indoor rooms, scaling to multi‑room facilities may need hierarchical graph structures.
- Future directions proposed by the authors include: integrating depth sensors for richer spatial reasoning, learning to prioritize which priors to verify (to save compute), and extending the approach to multi‑agent scenarios where several robots share and update a common graph.
Authors
- Anatoly O. Onishchenko
- Alexey K. Kovalev
- Aleksandr I. Panov
Paper Information
- arXiv ID: 2512.21243v1
- Categories: cs.RO, cs.AI, cs.LG
- Published: December 24, 2025
- PDF: Download PDF