[Paper] NewtPhys: Do Foundation Models Understand Newtonian Physics?
Source: arXiv - 2606.03986v1
Overview
The paper “NewtPhys: Do Foundation Models Understand Newtonian Physics?” introduces a groundbreaking dataset that captures real‑world, multi‑view video scenes together with dense, physics‑grounded annotations (forces, motion, semantics, geometry). By evaluating 56 vision‑language models (VLMs) and 10 vision‑foundation models (VFMs) on this data, the authors expose how current models still struggle with low‑level Newtonian reasoning despite impressive performance on synthetic benchmarks.
Key Contributions
- NewtPhys dataset: a 4‑D (space‑time) collection of real‑world multi‑view images paired with fine‑grained, per‑pixel physics annotations (3‑D forces, object velocities, contact maps, amodal masks, semantic labels, and geometry).
- Comprehensive benchmark: systematic evaluation of 56 open‑weight VLMs and 2 closed‑source frontier models, plus 10 VFMs, on low‑level physics tasks (force prediction, trajectory extrapolation, contact reasoning).
- Diagnostic analysis: detailed breakdown of where models succeed or fail (e.g., static object recognition vs. dynamic force inference).
- Open‑source release: code, data, and evaluation scripts are publicly available, encouraging community‑driven research on physics‑aware vision.
- Roadmap for future work: outlines how NewtPhys can serve as a testbed for training physics‑grounded models and for designing new evaluation protocols.
Methodology
- Data Capture: Real‑world tabletop scenes were recorded with multiple calibrated cameras, producing synchronized multi‑view video streams.
- Physics Simulation Overlay: Using a high‑fidelity physics engine, the authors simulated the exact same scenes, extracting ground‑truth quantities such as forces, contact normals, and object trajectories at each timestep.
- Dense Annotation Pipeline: The simulated data were projected back onto the captured images, yielding per‑pixel, amodal masks and continuous physics fields (e.g., force vectors) aligned with the real visual content.
- Benchmark Design: Tasks were formulated as vision‑language prompts (e.g., “What force is acting on the red block at t = 0.5 s?”) and pure vision tasks (e.g., predict the future velocity field).
- Model Evaluation: Each model’s output was compared against the ground‑truth annotations using metrics such as mean angular error for force direction, endpoint error for velocity fields, and IoU for amodal masks.
Results & Findings
- Overall performance lag: Even the largest open‑weight VLMs (e.g., LLaVA‑13B, MiniGPT‑4) achieved only ~30‑40 % accuracy on force‑direction queries, far below human baseline (~95 %).
- Closed‑source frontier models: GPT‑4V and Gemini showed modest improvements (~45 % accuracy) but still failed on nuanced contact reasoning.
- Vision‑only models: VFMs excelled at static geometry (high IoU for amodal masks) but performed poorly on dynamic quantities (high endpoint error for velocity).
- Error patterns: Models reliably identified object categories and rough motion direction but could not infer magnitude or vector composition of forces, indicating a lack of true Newtonian understanding.
- Cross‑modal gap: Adding language prompts helped marginally, suggesting that current VLMs do not effectively fuse visual dynamics with textual reasoning for physics tasks.
Practical Implications
- Robotics & AR/VR: Systems that need to predict physical interactions (e.g., robot grasp planning, AR object placement) cannot yet rely on off‑the‑shelf foundation models for accurate force estimation.
- Simulation‑to‑Reality Transfer: Developers using synthetic data to pre‑train models should be aware that performance does not automatically transfer to real‑world physics reasoning.
- Safety‑critical AI: Applications such as autonomous driving or industrial inspection that must anticipate physical consequences (e.g., object falling) will need dedicated physics modules rather than generic VLMs.
- Tooling for Developers: The open dataset and evaluation scripts enable rapid prototyping of physics‑aware perception pipelines, encouraging integration of differentiable physics engines with deep vision models.
Limitations & Future Work
- Scene Scope: NewtPhys focuses on tabletop, rigid‑body interactions; extending to deformable objects, fluids, or larger‑scale environments remains open.
- Annotation Fidelity: While physics simulations are high‑quality, any mismatch between simulated and actual material properties could introduce noise.
- Model Diversity: The benchmark covers primarily image‑based VLMs; video‑foundation models and multimodal transformers trained on longer temporal windows were not evaluated.
- Future Directions: The authors suggest augmenting the dataset with active manipulation data, exploring self‑supervised physics pre‑training, and developing evaluation metrics that better capture causal reasoning.
NewtPhys shines a light on a blind spot in today’s foundation models: the ability to truly “understand” Newtonian physics. For developers building systems that interact with the physical world, this work is a call to integrate dedicated physics reasoning components—or to train next‑generation models that can bridge the gap between visual perception and the laws that govern motion.
Authors
- Sebastian Cavada
- Soumava Paul
- Tuan-Hung Vu
- Andrei Bursuc
- Raoul de Charette
Paper Information
- arXiv ID: 2606.03986v1
- Categories: cs.CV
- Published: June 2, 2026
- PDF: Download PDF