[Paper] SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models

Published: 2 months ago (December 5, 2025 at 01:51 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.05955v1

Overview

The paper introduces SIMPACT, a test‑time framework that plugs a physics simulator into large vision‑language models (VLMs) so they can reason about how objects will move when acted upon. By turning a single RGB‑D snapshot into a lightweight simulation, the system lets the VLM “try out” actions, watch the simulated outcome, and iteratively improve its plan—without any extra training. This bridges the gap between the strong semantic knowledge of VLMs and the missing physical intuition needed for real‑world robotic manipulation.

Key Contributions

Simulation‑in‑the‑loop reasoning: Enables off‑the‑shelf VLMs to query a physics engine at test time, turning static visual understanding into dynamic, causal reasoning.
One‑shot world modeling: Constructs a compact physics simulation (rigid‑body + deformable) from a single RGB‑D observation, requiring no pre‑collected dynamics data.
Iterative action refinement: The VLM proposes an action, observes the simulated rollout, and can revise its plan in a closed‑loop fashion.
Zero‑training adaptation: No fine‑tuning of the VLM is needed; the simulation acts as an external knowledge source.
State‑of‑the‑art results: Achieves top performance on five real‑world manipulation benchmarks (both rigid and deformable objects), surpassing existing general‑purpose robotic models.

Methodology

Perception → Simulation:
- Capture an RGB‑D frame of the scene.
- Use off‑the‑shelf depth processing to segment objects, estimate poses, and infer basic physical properties (mass, friction) from visual cues.
- Populate a lightweight physics engine (e.g., PyBullet) with these objects, creating a “digital twin” of the tabletop scene.
Language‑Driven Planning:
- Feed the original image and a natural‑language task description (e.g., “stack the blue block on the red one”) to a pre‑trained VLM (such as GPT‑4‑V or LLaVA).
- The VLM outputs a high‑level action specification (grasp pose, push direction, etc.).
Simulation Rollout:
- Execute the proposed action inside the simulated world.
- Record the resulting object trajectories and any contact events.
Iterative Feedback:
- Present the simulated outcome (as images or state vectors) back to the VLM, prompting it to reason about success or failure.
- The VLM can then suggest a refined action, and the loop repeats until a satisfactory plan emerges or a budget of simulations is exhausted.
Execution on Real Robot:
- The final, simulation‑validated action is transferred to the physical robot for execution.

The whole pipeline runs at test time, leveraging the VLM’s language reasoning while grounding it in physics‑based predictions.

Results & Findings

Task	Rigid / Deformable	Success Rate (SIMPACT)	Prior Best
Block stacking	Rigid	92 %	78 %
Object insertion	Rigid	88 %	71 %
Cable routing	Deformable	84 %	60 %
Cloth folding	Deformable	81 %	65 %
Shape‑matching (mixed)	Both	86 %	73 %

SIMPACT consistently outperforms baseline VLM‑only planners and recent end‑to‑end manipulation networks.
Ablation studies show that removing the simulation loop drops performance by ~15 % on average, confirming the critical role of physics grounding.
The system requires only a few seconds of simulation per iteration, making it practical for on‑the‑fly planning.

Practical Implications

Rapid prototyping of robot skills: Developers can reuse any existing VLM (e.g., GPT‑4‑V) and add a simulation wrapper to endow it with physical intuition, avoiding costly data collection or model retraining.
General‑purpose home robots: Tasks like tidying up, arranging groceries, or handling soft items (clothes, cables) become feasible with a single visual snapshot and a natural‑language command.
Simulation‑augmented AI assistants: Beyond robotics, any AI that must predict the outcome of physical actions (e.g., AR/VR assistants, digital twins for manufacturing) can adopt the same loop to improve safety and reliability.
Reduced reliance on large‑scale interaction datasets: By leveraging physics engines at test time, companies can sidestep the massive effort of gathering millions of robot‑interaction logs.

Limitations & Future Work

Physics fidelity vs. speed trade‑off: The current implementation uses simplified contact models; highly complex deformable dynamics (e.g., fluid‑like materials) may still be mis‑predicted.
Perception errors: Inaccurate pose or property estimation from a single view can propagate into the simulation, leading to sub‑optimal plans. Multi‑view or active perception could mitigate this.
Scalability to large scenes: Building a full‑scene simulation for cluttered environments remains computationally heavy; hierarchical or object‑centric simulations are a promising direction.
Learning to query the simulator: Future work could train a lightweight policy that decides when to invoke simulation versus trusting the VLM’s intuition, further reducing latency.

Overall, SIMPACT demonstrates that embedding a physics engine into the reasoning loop of vision‑language models is a practical, training‑free pathway toward more physically aware AI agents.

Authors

Haowen Liu
Shaoxiong Yao
Haonan Chen
Jiawei Gao
Jiayuan Mao
Jia‑Bin Huang
Yilun Du

Paper Information

arXiv ID: 2512.05955v1
Categories: cs.RO, cs.CV
Published: December 5, 2025
PDF: Download PDF

[Paper] SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] EditThinker: Unlocking Iterative Reasoning for Any Image Editor

[Paper] AQUA-Net: Adaptive Frequency Fusion and Illumination Aware Network for Underwater Image Enhancement

[Paper] M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

[Paper] Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding