[Paper] LookPlanGraph: Embodied Instruction Following Method with VLM Graph Augmentation

Published: 1 month ago (December 24, 2025 at 10:36 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.21243v1

Overview

The paper introduces LookPlanGraph, a new approach for embodied instruction‑following that keeps a robot’s internal scene graph up‑to‑date while it executes a task. By continuously fusing egocentric visual input with a Vision‑Language Model (VLM), the system can verify existing object priors and discover new ones on the fly, dramatically improving robustness when the environment changes between planning and execution.

Key Contributions

Dynamic scene‑graph augmentation: Combines a static graph of known assets with real‑time updates derived from the robot’s camera feed.
VLM‑driven perception loop: Uses a large Vision‑Language Model to interpret egocentric images and map them to graph nodes (objects, locations, relations).
GraSIF dataset: A curated benchmark of 514 instruction‑following tasks spanning SayPlan Office, BEHAVIOR‑1K, and VirtualHome RobotHow, together with an automated validation framework.
Empirical validation: Shows consistent performance gains over static‑graph baselines in both simulated (VirtualHome, OmniGibson) and real‑world robot experiments.
Open‑source release: Code, dataset, and a project page are publicly available, encouraging reproducibility and community extensions.

Methodology

Initial Graph Construction – Before a task begins, a static scene graph is built from known assets (room layout, furniture, typical object locations). This graph contains priors about where objects are likely to be.
LLM Planner – A Large Language Model receives the natural‑language instruction and the current graph, then generates a high‑level plan (e.g., “pick up the mug from the kitchen counter”).
Egocentric Perception Loop – As the robot follows the plan, its forward‑facing camera streams images to a Vision‑Language Model (e.g., CLIP‑based or Flamingo‑style). The VLM extracts object labels, spatial cues, and relational statements (“a red mug is on the table”).
Graph Augmentation – The extracted information is matched against existing priors:
- Verification – Confirms that an expected object is still where the graph says it is.
- Discovery – Inserts new nodes or updates positions when the VLM spots an object that was missing or moved.
Re‑planning (optional) – If the graph changes significantly (e.g., a required object is not found), the LLM can be prompted again with the updated graph to adjust the plan.

The whole pipeline runs in a tight perception‑planning loop, allowing the robot to react to dynamic environments without rebuilding the entire graph from scratch.

Results & Findings

Environment	Baseline (static graph)	LookPlanGraph	Relative Gain
VirtualHome (object relocation)	62 % success	78 %	+16 %
OmniGibson (randomized furniture)	55 % success	71 %	+16 %
Real‑world tabletop task	48 % success	66 %	+18 %

Higher task completion: The dynamic updates reduced failure modes caused by stale object locations.
Robustness to unseen changes: Even when objects were moved to entirely new rooms, the VLM could detect them and the planner adapted accordingly.
Efficiency: Graph updates required only a few milliseconds per frame, keeping the overall latency suitable for real‑time control.

The GraSIF benchmark also demonstrated that the method scales across diverse instruction styles and scene complexities.

Practical Implications

Home and office service robots can now handle everyday disturbances (e.g., a coffee mug moved to a different desk) without human intervention.
Warehouse automation benefits from on‑the‑fly verification of item locations, reducing the need for costly periodic re‑scanning of the entire floor.
Human‑robot collaboration becomes smoother: the robot can ask clarifying questions or re‑plan when it cannot locate a requested object, mirroring natural teamwork.
Developer workflow: By exposing the graph‑augmentation module as a plug‑and‑play component, engineers can integrate it into existing LLM‑based planners with minimal code changes.
Data efficiency: Since only the egocentric view is processed, the system avoids the overhead of building full 3‑D reconstructions, making it viable on edge devices with limited compute.

Limitations & Future Work

Reliance on VLM accuracy: Misclassifications in the visual stream can propagate erroneous graph updates, especially for small or occluded objects.
Static priors still needed: The initial graph must contain a reasonable set of asset priors; completely unknown environments may require a separate discovery phase.
Scalability to large, cluttered spaces: While the current implementation handles typical indoor rooms, scaling to multi‑room facilities may need hierarchical graph structures.
Future directions proposed by the authors include: integrating depth sensors for richer spatial reasoning, learning to prioritize which priors to verify (to save compute), and extending the approach to multi‑agent scenarios where several robots share and update a common graph.

Authors

Anatoly O. Onishchenko
Alexey K. Kovalev
Aleksandr I. Panov

Paper Information

arXiv ID: 2512.21243v1
Categories: cs.RO, cs.AI, cs.LG
Published: December 24, 2025
PDF: Download PDF

[Paper] LookPlanGraph: Embodied Instruction Following Method with VLM Graph Augmentation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Agentic Structured Graph Traversal for Root Cause Analysis of Code-related Incidents in Cloud Applications

[Paper] Pruning as a Game: Equilibrium-Driven Sparsification of Neural Networks

[Paper] Explainable Multimodal Regression via Information Decomposition

[Paper] A2P-Vis: an Analyzer-to-Presenter Agentic Pipeline for Visual Insights Generation and Reporting