A better method for planning complex visual tasks

Published: (March 11, 2026 at 12:00 AM EDT)
5 min read

Source: MIT News - AI

MIT Researchers Introduce a Generative‑AI‑Driven Approach for Long‑Term Visual Planning

MIT researchers have developed a generative artificial‑intelligence‑driven approach for planning long‑term visual tasks—such as robot navigation—that is about twice as effective as several existing techniques.

The method uses a specialized vision‑language model (VLM) to perceive a scenario in an image and simulate the actions needed to reach a goal. A second model then translates those simulations into a standard programming language for planning problems and refines the solution.

In the end, the system automatically generates a set of files that can be fed into classical planning software, which computes a plan to achieve the goal. This two‑step system produced plans with an average success rate of ≈ 70 %, outperforming the best baseline methods that reached only ≈ 30 %.

Importantly, the system can solve new problems it hasn’t encountered before, making it well‑suited for real environments where conditions can change at a moment’s notice.

“Our framework combines the advantages of vision‑language models, like their ability to understand images, with the strong planning capabilities of a formal solver,” says Yilun Hao, an aeronautics and astronautics (AeroAstro) graduate student at MIT and lead author of an open‑access paper on this technique. “It can take a single image and move it through simulation and then to a reliable, long‑horizon plan that could be useful in many real‑life applications.”

She is joined on the paper by Yongchao Chen (graduate student, MIT Laboratory for Information and Decision Systems – LIDS), Chuchu Fan (associate professor, AeroAstro and principal investigator, LIDS), and Yang Zhang (research scientist, MIT‑IBM Watson AI Lab). The paper will be presented at the International Conference on Learning Representations (ICLR).


Tackling Visual Tasks

For the past few years, Fan and her colleagues have studied the use of generative AI models to perform complex reasoning and planning, often employing large language models (LLMs) to process text inputs.

  • Many real‑world planning problems—robotic assembly, autonomous driving, etc.—have visual inputs that an LLM can’t handle well on its own.
  • The researchers therefore turned to vision‑language models (VLMs), powerful AI systems that can process both images and text.

However, VLMs struggle with:

  • Understanding spatial relationships between objects in a scene.
  • Reasoning correctly over many steps, which is essential for long‑range planning.

Conversely, formal planners can generate effective long‑horizon plans for complex situations, but they cannot process visual inputs and require expert knowledge to encode a problem into a language the solver can understand.

The VLMFP System

Fan’s team built an automatic planning system that takes the best of both worlds. The system, called VLM‑guided Formal Planning (VLMFP), utilizes two specialized VLMs that work together to turn visual planning problems into ready‑to‑use files for formal planning software.

  1. SimVLM – a small model trained to describe the scenario in an image using natural language and to simulate a sequence of actions in that scenario.
  2. GenVLM – a much larger model that takes SimVLM’s description and generates a set of initial files in the Planning Domain Definition Language (PDDL).

The generated files are fed into a classical PDDL solver, which computes a step‑by‑step plan to solve the task. GenVLM then compares the solver’s results with those of the simulator and iteratively refines the PDDL files.

“The generator and simulator work together to reach the exact same result—an action simulation that achieves the goal,” Hao says.

Because GenVLM is a large generative AI model, it has seen many examples of PDDL during training and learned how this formal language can solve a wide range of problems. This existing knowledge enables the model to generate accurate PDDL files.


A Flexible Approach

VLMFP produces two separate PDDL files:

FilePurpose
Domain fileDefines the environment, valid actions, and domain rules.
Problem fileDefines the initial states and the goal for a particular instance.

“One advantage of PDDL is that the domain file is the same for all instances in that environment. This makes our framework good at generalizing to unseen instances under the same domain,” Hao explains.

Training & Generalization

To enable effective generalization, the researchers carefully designed a modest amount of training data for SimVLM so the model learned to understand the problem and goal without memorizing specific patterns. In tests, SimVLM successfully:

  • Described the scenario.
  • Simulated actions.
  • Detected whether the goal was reached.

…in about 85 % of experiments.

Performance

Task TypeSuccess Rate
Six 2‑D planning tasks≈ 60 %
Two 3‑D tasks (multirobot collaboration, robotic assembly)> 80 %
Unseen scenarios (valid plans generated)> 50 %

These results far outpace baseline methods, which struggled to exceed 30 % success on comparable benchmarks.

“Our framework can generalize when the rules change in different situations. This gives our system the flexibility to solve many types of visual‑based planning problems,” Fan adds.


Future Directions

The team plans to:

  • Extend VLMFP to handle more complex scenarios.
  • Develop methods to identify and mitigate hallucinations by the VLMs.

“In the long term, generative AI models could act as agents and make use of the right tools to solve much more complicated problems. But what does it mean to have the right tools, and how do we incorporate those tools? There is still a long way to go, but by bringing visual‑based planning into the picture, …”

(The quote continues in the original paper.)

Cleaned Markdown

“His work is an important piece of the puzzle,” Fan says.

This work was funded, in part, by the MIT‑IBM Watson AI Lab.

0 views
Back to Blog

Related posts

Read more »