A better method for planning complex visual tasks

Published: 1 month ago (March 11, 2026 at 12:00 AM EDT)

5 min read

Source: MIT News - AI

MIT Researchers Introduce a Generative‑AI‑Driven Approach for Long‑Term Visual Planning

MIT researchers have created a generative‑AI system that plans long‑term visual tasks (e.g., robot navigation) twice as effectively as several existing techniques.

How the method works

Vision‑Language Model (VLM) – perceives a scenario from an image and simulates the actions required to reach a goal.
Translation Model – converts those simulations into a standard programming language for planning problems and refines the solution.
The system then generates files that can be fed into classical planning software, which computes a plan to achieve the goal.

Performance

Success rate: ≈ 70 %
Baseline methods: ≈ 30 %

The approach also generalizes to unseen problems, making it suitable for dynamic real‑world environments.

“Our framework combines the advantages of vision‑language models, like their ability to understand images, with the strong planning capabilities of a formal solver,” says Yilun Hao, an Aeronautics and Astronautics (AeroAstro) graduate student at MIT and lead author of an open‑access paper. “It can take a single image and move it through simulation and then to a reliable, long‑horizon plan that could be useful in many real‑life applications.”

Authors

Yilun Hao – AeroAstro graduate student (lead author)
Yongchao Chen – Graduate student, MIT Laboratory for Information and Decision Systems (LIDS)
Chuchu Fan – Associate professor, AeroAstro & principal investigator, LIDS
Yang Zhang – Research scientist, MIT‑IBM Watson AI Lab

The paper will be presented at the International Conference on Learning Representations (ICLR).

Tackling Visual Tasks

For the past few years, Fan and her colleagues have studied the use of generative AI models to perform complex reasoning and planning, often employing large language models (LLMs) to process text inputs.

Many real‑world planning problems—robotic assembly, autonomous driving, etc.—have visual inputs that an LLM can’t handle well on its own.
The researchers therefore turned to vision‑language models (VLMs), powerful AI systems that can process both images and text.

Challenges

VLMs struggle with:

Understanding spatial relationships between objects in a scene.
Reasoning correctly over many steps, which is essential for long‑range planning.

Formal planners excel at:

Generating effective long‑horizon plans for complex situations.
However, they cannot process visual inputs and require expert knowledge to encode a problem into a language the solver can understand.

The VLMFP System

Fan’s team built an automatic planning system that takes the best of both worlds. The system, called VLM‑guided Formal Planning (VLMFP), utilizes two specialized VLMs that work together to turn visual planning problems into ready‑to‑use files for formal planning software.

SimVLM – a small model trained to describe the scenario in an image using natural language and to simulate a sequence of actions in that scenario.
GenVLM – a much larger model that takes SimVLM’s description and generates a set of initial files in the Planning Domain Definition Language (PDDL).

The generated files are fed into a classical PDDL solver, which computes a step‑by‑step plan to solve the task. GenVLM then compares the solver’s results with those of the simulator and iteratively refines the PDDL files.

“The generator and simulator work together to reach the exact same result—an action simulation that achieves the goal,” Hao says.

Because GenVLM is a large generative AI model, it has seen many examples of PDDL during training and learned how this formal language can solve a wide range of problems. This existing knowledge enables the model to generate accurate PDDL files.

A Flexible Approach

VLMFP produces two separate PDDL files:

File	Purpose
Domain file	Defines the environment, valid actions, and domain rules.
Problem file	Defines the initial states and the goal for a particular instance.

“One advantage of PDDL is that the domain file is the same for all instances in that environment. This makes our framework good at generalizing to unseen instances under the same domain,” – Hao

Training & Generalization

To enable effective generalization, the researchers carefully designed a modest amount of training data for SimVLM so the model learned to understand the problem and goal without memorizing specific patterns. In tests, SimVLM successfully:

Described the scenario.
Simulated actions.
Detected whether the goal was reached.

…in about 85 % of experiments.

Performance

Task Type	Success Rate
Six 2‑D planning tasks	≈ 60 %
Two 3‑D tasks (multirobot collaboration, robotic assembly)	> 80 %
Unseen scenarios (valid plans generated)	> 50 %

These results far outpace baseline methods, which struggled to exceed 30 % success on comparable benchmarks.

“Our framework can generalize when the rules change in different situations. This gives our system the flexibility to solve many types of visual‑based planning problems,” – Fan

Future Directions

The team plans to:

Extend VLMFP to handle more complex scenarios.
Develop methods to identify and mitigate hallucinations by the VLMs.

“In the long term, generative AI models could act as agents and make use of the right tools to solve much more complicated problems. But what does it mean to have the right tools, and how do we incorporate those tools? There is still a long way to go, but by bringing visual‑based planning into the picture, …”
(The quote continues in the original paper.)

This work was funded, in part, by the MIT‑IBM Watson AI Lab.

A better method for planning complex visual tasks

MIT Researchers Introduce a Generative‑AI‑Driven Approach for Long‑Term Visual Planning

How the method works

Performance

Authors

Tackling Visual Tasks

Challenges

The VLMFP System

A Flexible Approach

Training & Generalization

Performance

Future Directions

Related posts

DLSS 5 looks like a real-time generative AI filter for video games

Certified FinOps for AI: A Complete Guide to Managing AI Costs in the Cloud

System Building in Human Language: The Era of the AI Business OS

How AI is rewriting the rules of music creation and production