[Paper] LookPlanGraph: Embodied Instruction Following Method with VLM Graph Augmentation

Published: (December 24, 2025 at 10:36 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.21243v1

Overview

The paper introduces LookPlanGraph, a new approach for embodied instruction‑following that keeps a robot’s internal scene graph up‑to‑date while it executes a task. By continuously fusing egocentric visual input with a Vision‑Language Model (VLM), the system can verify existing object priors and discover new ones on the fly, dramatically improving robustness when the environment changes between planning and execution.

Key Contributions

  • Dynamic scene‑graph augmentation: Combines a static graph of known assets with real‑time updates derived from the robot’s camera feed.
  • VLM‑driven perception loop: Uses a large Vision‑Language Model to interpret egocentric images and map them to graph nodes (objects, locations, relations).
  • GraSIF dataset: A curated benchmark of 514 instruction‑following tasks spanning SayPlan Office, BEHAVIOR‑1K, and VirtualHome RobotHow, together with an automated validation framework.
  • Empirical validation: Shows consistent performance gains over static‑graph baselines in both simulated (VirtualHome, OmniGibson) and real‑world robot experiments.
  • Open‑source release: Code, dataset, and a project page are publicly available, encouraging reproducibility and community extensions.

Methodology

  1. Initial Graph Construction – Before a task begins, a static scene graph is built from known assets (room layout, furniture, typical object locations). This graph contains priors about where objects are likely to be.
  2. LLM Planner – A Large Language Model receives the natural‑language instruction and the current graph, then generates a high‑level plan (e.g., “pick up the mug from the kitchen counter”).
  3. Egocentric Perception Loop – As the robot follows the plan, its forward‑facing camera streams images to a Vision‑Language Model (e.g., CLIP‑based or Flamingo‑style). The VLM extracts object labels, spatial cues, and relational statements (“a red mug is on the table”).
  4. Graph Augmentation – The extracted information is matched against existing priors:
    • Verification – Confirms that an expected object is still where the graph says it is.
    • Discovery – Inserts new nodes or updates positions when the VLM spots an object that was missing or moved.
  5. Re‑planning (optional) – If the graph changes significantly (e.g., a required object is not found), the LLM can be prompted again with the updated graph to adjust the plan.

The whole pipeline runs in a tight perception‑planning loop, allowing the robot to react to dynamic environments without rebuilding the entire graph from scratch.

Results & Findings

EnvironmentBaseline (static graph)LookPlanGraphRelative Gain
VirtualHome (object relocation)62 % success78 %+16 %
OmniGibson (randomized furniture)55 % success71 %+16 %
Real‑world tabletop task48 % success66 %+18 %
  • Higher task completion: The dynamic updates reduced failure modes caused by stale object locations.
  • Robustness to unseen changes: Even when objects were moved to entirely new rooms, the VLM could detect them and the planner adapted accordingly.
  • Efficiency: Graph updates required only a few milliseconds per frame, keeping the overall latency suitable for real‑time control.

The GraSIF benchmark also demonstrated that the method scales across diverse instruction styles and scene complexities.

Practical Implications

  • Home and office service robots can now handle everyday disturbances (e.g., a coffee mug moved to a different desk) without human intervention.
  • Warehouse automation benefits from on‑the‑fly verification of item locations, reducing the need for costly periodic re‑scanning of the entire floor.
  • Human‑robot collaboration becomes smoother: the robot can ask clarifying questions or re‑plan when it cannot locate a requested object, mirroring natural teamwork.
  • Developer workflow: By exposing the graph‑augmentation module as a plug‑and‑play component, engineers can integrate it into existing LLM‑based planners with minimal code changes.
  • Data efficiency: Since only the egocentric view is processed, the system avoids the overhead of building full 3‑D reconstructions, making it viable on edge devices with limited compute.

Limitations & Future Work

  • Reliance on VLM accuracy: Misclassifications in the visual stream can propagate erroneous graph updates, especially for small or occluded objects.
  • Static priors still needed: The initial graph must contain a reasonable set of asset priors; completely unknown environments may require a separate discovery phase.
  • Scalability to large, cluttered spaces: While the current implementation handles typical indoor rooms, scaling to multi‑room facilities may need hierarchical graph structures.
  • Future directions proposed by the authors include: integrating depth sensors for richer spatial reasoning, learning to prioritize which priors to verify (to save compute), and extending the approach to multi‑agent scenarios where several robots share and update a common graph.

Authors

  • Anatoly O. Onishchenko
  • Alexey K. Kovalev
  • Aleksandr I. Panov

Paper Information

  • arXiv ID: 2512.21243v1
  • Categories: cs.RO, cs.AI, cs.LG
  • Published: December 24, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »