[Paper] Can vision language models learn intuitive physics from interaction?
Source: arXiv - 2602.06033v1
Overview
A recent study investigates whether large vision‑language models (VLMs) can acquire an “intuitive physics”—the kind of commonsense understanding of gravity, collisions, and object permanence that humans develop through everyday interaction. The authors explore if letting these models learn by actively interacting with a simulated environment (via reinforcement learning) can produce more robust, transferable physical reasoning than standard supervised fine‑tuning.
Key Contributions
- Interaction‑based training pipeline: Introduces a reinforcement‑learning (RL) framework that lets pre‑trained VLMs act, observe, and receive feedback in a physics‑rich simulated world.
- Systematic generalization tests: Designs a suite of related physical tasks (e.g., stacking, rolling, catching) that share visual features but differ in dynamics, to probe cross‑task transfer.
- Empirical finding on robustness: Shows that while interaction improves performance on the specific task the model is trained on, it does not yield a model that generalizes its physical intuition to new, but related, scenarios.
- Baseline comparison: Benchmarks interaction‑trained VLMs against supervised fine‑tuned VLMs, confirming that neither approach achieves strong out‑of‑distribution physical reasoning.
Methodology
- Base model: The authors start from a state‑of‑the‑art vision‑language model (e.g., CLIP‑based encoder‑decoder) that already understands image–text pairs.
- Environment: A lightweight physics simulator (similar to Unity or MuJoCo) provides a set of tasks where an agent must predict or manipulate object trajectories (e.g., “Will the ball fall off the platform?”).
- Learning via RL:
- The VLM receives an image of the scene and a textual prompt.
- It outputs an action (e.g., “push left”, “wait”).
- The simulator returns a reward based on whether the physical prediction was correct or the manipulation succeeded.
- Policy gradients (PPO) update the VLM’s parameters, allowing it to refine its internal representation of physics through trial‑and‑error.
- Evaluation protocol: After training on a single task, the same model is tested on three held‑out tasks that share the same visual statistics but require different physical reasoning. Performance is measured both as raw accuracy and as the ability to predict future states.
Results & Findings
- Within‑task gains: Interaction‑trained VLMs improve from ~55 % to ~78 % accuracy on the task they were trained on, outperforming supervised fine‑tuning (≈70 %).
- Cross‑task drop‑off: When evaluated on a new task, accuracy falls back to ~52 %, essentially the same as the un‑adapted pre‑trained baseline.
- No clear benefit from interaction: Even when the training and test tasks share the same underlying physics (e.g., gravity) and visual layout, the learned policies do not transfer.
- Analysis of representations: Probing the hidden layers shows that interaction reshapes some visual features but does not produce a unified, abstract physics module.
Practical Implications
- Caution for developers: Simply fine‑tuning a VLM with RL in a simulated physics environment is unlikely to give you a model that can reliably reason about unseen physical scenarios (e.g., robotics planning, AR/VR object interactions).
- Need for dedicated physics modules: Companies building embodied AI (robots, autonomous drones) may need to integrate explicit physics engines or specialized simulation‑trained models rather than relying on VLMs alone.
- Dataset design insight: To achieve transferable intuition, training data must expose models to a variety of physical contexts, not just a single task, suggesting multi‑task curricula or meta‑learning approaches.
- Potential for hybrid systems: The study hints that VLMs excel at perception and language grounding, while separate, perhaps symbolic or graph‑based physics solvers could handle dynamics, opening avenues for modular AI pipelines.
Limitations & Future Work
- Limited task diversity: The experiments focus on a handful of relatively simple physics tasks; more complex, multi‑object interactions might reveal different patterns.
- Simulation‑only setting: Real‑world noise (friction variability, sensor errors) is absent, so findings may not directly translate to physical robots.
- Model size & architecture: Only one class of VLMs was examined; larger or multimodal transformers (e.g., Flamingo, GPT‑4‑V) could behave differently.
- Future directions suggested:
- Multi‑task or meta‑RL curricula that explicitly encourage abstraction across physics domains.
- Incorporating structured physical priors (e.g., graph neural networks) into the VLM’s latent space.
- Evaluating transfer to real‑world robotic platforms to test whether simulated interaction can bridge the reality gap.
Authors
- Luca M. Schulze Buschoff
- Konstantinos Voudouris
- Can Demircan
- Eric Schulz
Paper Information
- arXiv ID: 2602.06033v1
- Categories: cs.LG
- Published: February 5, 2026
- PDF: Download PDF