[Paper] NovaPlan: Zero-Shot Long-Horizon Manipulation via Closed-Loop Video Language Planning

Published: (February 23, 2026 at 01:35 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.20119v1

Overview

NovaPlan tackles one of robotics’ toughest challenges: getting a robot to carry out multi‑step, open‑ended manipulation tasks without any task‑specific training. By marrying large vision‑language models (VLMs) with video‑based planning and a geometry‑aware low‑level controller, the system can think, watch, and act in a closed loop, automatically recovering from mistakes on the fly.

Key Contributions

  • Zero‑shot hierarchical planning – A VLM‑driven high‑level planner breaks down arbitrary natural‑language instructions into sub‑goals and continuously monitors execution.
  • Closed‑loop video‑based imagination – The system generates short video clips of the desired sub‑goal, extracts both object keypoints and human hand poses, and uses them as motion priors for the robot.
  • Dynamic prior switching – A lightweight selector chooses between object‑centric and hand‑centric priors depending on visual conditions (e.g., occlusions, depth noise), keeping the robot’s motion stable.
  • Autonomous error recovery – If a low‑level action fails, the high‑level VLM replans the remaining steps, enabling robust long‑horizon behavior without human intervention.
  • Broad evaluation – Demonstrated on three complex assembly tasks and the Functional Manipulation Benchmark (FMB), outperforming prior zero‑shot baselines.

Methodology

  1. High‑level semantic planner – A pre‑trained vision‑language model receives the user’s natural‑language command (e.g., “assemble the toy car”). It generates a sequence of textual sub‑goals (e.g., “pick up the wheel”, “attach wheel to axle”).
  2. Closed‑loop monitoring – After each sub‑goal, the robot streams its camera feed back to the VLM. If the observed state deviates from the imagined outcome, the planner revises the remaining plan.
  3. Video imagination & prior extraction – For each sub‑goal, a video generation model synthesizes a short clip of a human performing the step. From this clip the system extracts:
    • Object keypoints (e.g., corners of a block) that define where the robot should grasp or place items.
    • Human hand poses that provide a kinematic trajectory.
  4. Prior selection & low‑level control – A lightweight classifier evaluates visual reliability (occlusion, depth error) and picks the more trustworthy prior. The chosen prior is then converted into joint‑space commands using a geometry‑aware controller that respects collision constraints.
  5. Iterative execution – The robot executes the low‑level motion, streams back sensor data, and the loop repeats until the full task is completed.

Results & Findings

Task / BenchmarkSuccess Rate (Zero‑Shot)Compared BaselineNotable Behaviors
Toy Car Assembly (4 steps)87 %VLM‑only planning (45 %)Re‑planned after a missed grasp, completed assembly.
Shelf‑Stacking (5 objects)81 %Video‑only prior (58 %)Switched to object‑keypoint prior when hand pose was occluded.
Functional Manipulation Benchmark (FMB)73 % (average across 10 tasks)Prior state‑of‑the‑art zero‑shot (62 %)Demonstrated dexterous error recovery, e.g., re‑grasping a slipped object.

Key takeaways

  • The closed‑loop VLM monitor dramatically reduces failure propagation; a single mis‑step rarely derails the whole task.
  • Prior switching improves robustness under challenging visual conditions, yielding smoother trajectories.
  • All capabilities emerge without any task‑specific demonstrations or fine‑tuning, confirming the zero‑shot claim.

Practical Implications

  • Rapid prototyping for new tasks – Engineers can hand a robot a plain English instruction and let NovaPlan generate a viable execution plan, cutting down on data collection and annotation costs.
  • Adaptive manufacturing cells – In flexible factories where product variants change frequently, NovaPlan can re‑configure manipulation sequences on the fly, handling unexpected part placements or minor jams.
  • Assistive robotics – Home‑assistant robots could interpret user commands (“set the table”) and recover gracefully if a plate slips, making them safer and more trustworthy.
  • Tool‑agnostic development – Because the system relies on generic video generation and VLMs, it can be integrated with existing robot stacks (ROS, MoveIt) without bespoke perception pipelines.

Limitations & Future Work

  • Reliance on video generation quality – Poorly imagined clips (e.g., unrealistic lighting) can corrupt keypoint extraction, limiting performance in highly cluttered scenes.
  • Depth sensor accuracy – The geometry controller still suffers when depth measurements are noisy, especially for reflective or transparent objects.
  • Scalability of VLM monitoring – Real‑time closed‑loop inference can become a bottleneck on edge hardware; future work may explore lightweight distillation.
  • Extending to non‑rigid manipulation – Current experiments focus on rigid objects; handling deformable items (cloth, food) will require richer priors and possibly tactile feedback.

The authors plan to explore tighter integration of tactile sensing, improve the robustness of video priors under domain shift, and benchmark NovaPlan on larger‑scale industrial assembly lines.

Authors

  • Jiahui Fu
  • Junyu Nan
  • Lingfeng Sun
  • Hongyu Li
  • Jianing Qian
  • Jennifer L. Barry
  • Kris Kitani
  • George Konidaris

Paper Information

  • arXiv ID: 2602.20119v1
  • Categories: cs.RO, cs.AI, cs.CV
  • Published: February 23, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »