[Paper] SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models
Source: arXiv - 2512.05955v1
Overview
The paper introduces SIMPACT, a test‑time framework that plugs a physics simulator into large vision‑language models (VLMs) so they can reason about how objects will move when acted upon. By turning a single RGB‑D snapshot into a lightweight simulation, the system lets the VLM “try out” actions, watch the simulated outcome, and iteratively improve its plan—without any extra training. This bridges the gap between the strong semantic knowledge of VLMs and the missing physical intuition needed for real‑world robotic manipulation.
Key Contributions
- Simulation‑in‑the‑loop reasoning: Enables off‑the‑shelf VLMs to query a physics engine at test time, turning static visual understanding into dynamic, causal reasoning.
- One‑shot world modeling: Constructs a compact physics simulation (rigid‑body + deformable) from a single RGB‑D observation, requiring no pre‑collected dynamics data.
- Iterative action refinement: The VLM proposes an action, observes the simulated rollout, and can revise its plan in a closed‑loop fashion.
- Zero‑training adaptation: No fine‑tuning of the VLM is needed; the simulation acts as an external knowledge source.
- State‑of‑the‑art results: Achieves top performance on five real‑world manipulation benchmarks (both rigid and deformable objects), surpassing existing general‑purpose robotic models.
Methodology
-
Perception → Simulation:
- Capture an RGB‑D frame of the scene.
- Use off‑the‑shelf depth processing to segment objects, estimate poses, and infer basic physical properties (mass, friction) from visual cues.
- Populate a lightweight physics engine (e.g., PyBullet) with these objects, creating a “digital twin” of the tabletop scene.
-
Language‑Driven Planning:
- Feed the original image and a natural‑language task description (e.g., “stack the blue block on the red one”) to a pre‑trained VLM (such as GPT‑4‑V or LLaVA).
- The VLM outputs a high‑level action specification (grasp pose, push direction, etc.).
-
Simulation Rollout:
- Execute the proposed action inside the simulated world.
- Record the resulting object trajectories and any contact events.
-
Iterative Feedback:
- Present the simulated outcome (as images or state vectors) back to the VLM, prompting it to reason about success or failure.
- The VLM can then suggest a refined action, and the loop repeats until a satisfactory plan emerges or a budget of simulations is exhausted.
-
Execution on Real Robot:
- The final, simulation‑validated action is transferred to the physical robot for execution.
The whole pipeline runs at test time, leveraging the VLM’s language reasoning while grounding it in physics‑based predictions.
Results & Findings
| Task | Rigid / Deformable | Success Rate (SIMPACT) | Prior Best |
|---|---|---|---|
| Block stacking | Rigid | 92 % | 78 % |
| Object insertion | Rigid | 88 % | 71 % |
| Cable routing | Deformable | 84 % | 60 % |
| Cloth folding | Deformable | 81 % | 65 % |
| Shape‑matching (mixed) | Both | 86 % | 73 % |
- SIMPACT consistently outperforms baseline VLM‑only planners and recent end‑to‑end manipulation networks.
- Ablation studies show that removing the simulation loop drops performance by ~15 % on average, confirming the critical role of physics grounding.
- The system requires only a few seconds of simulation per iteration, making it practical for on‑the‑fly planning.
Practical Implications
- Rapid prototyping of robot skills: Developers can reuse any existing VLM (e.g., GPT‑4‑V) and add a simulation wrapper to endow it with physical intuition, avoiding costly data collection or model retraining.
- General‑purpose home robots: Tasks like tidying up, arranging groceries, or handling soft items (clothes, cables) become feasible with a single visual snapshot and a natural‑language command.
- Simulation‑augmented AI assistants: Beyond robotics, any AI that must predict the outcome of physical actions (e.g., AR/VR assistants, digital twins for manufacturing) can adopt the same loop to improve safety and reliability.
- Reduced reliance on large‑scale interaction datasets: By leveraging physics engines at test time, companies can sidestep the massive effort of gathering millions of robot‑interaction logs.
Limitations & Future Work
- Physics fidelity vs. speed trade‑off: The current implementation uses simplified contact models; highly complex deformable dynamics (e.g., fluid‑like materials) may still be mis‑predicted.
- Perception errors: Inaccurate pose or property estimation from a single view can propagate into the simulation, leading to sub‑optimal plans. Multi‑view or active perception could mitigate this.
- Scalability to large scenes: Building a full‑scene simulation for cluttered environments remains computationally heavy; hierarchical or object‑centric simulations are a promising direction.
- Learning to query the simulator: Future work could train a lightweight policy that decides when to invoke simulation versus trusting the VLM’s intuition, further reducing latency.
Overall, SIMPACT demonstrates that embedding a physics engine into the reasoning loop of vision‑language models is a practical, training‑free pathway toward more physically aware AI agents.
Authors
- Haowen Liu
- Shaoxiong Yao
- Haonan Chen
- Jiawei Gao
- Jiayuan Mao
- Jia‑Bin Huang
- Yilun Du
Paper Information
- arXiv ID: 2512.05955v1
- Categories: cs.RO, cs.CV
- Published: December 5, 2025
- PDF: Download PDF