[Paper] Do Video Foundation Models Understand Intuitive Physics? A Layerwise Probing Analysis
Source: arXiv - 2606.09646v1
Overview
We study whether pretrained video foundation models encode intuitive-physics information in their frozen representations, and how this information varies across model families, layers, and probe types. Using frozen-feature probing on IntPhys2 and Minimal Video Pairs (MVP), we compare predictive joint-embedding models (V-JEPA), masked reconstruction models (VideoMAE), and a diffusion-based video generator (LTX-Video). V-JEPA achieves the strongest overall results across benchmarks, especially with probes that model temporal dynamics, while VideoMAE remains competitive and LTX-Video recovers weaker but non-trivial signal. Layerwise analyses show that physics-relevant information is weakest in early layers and becomes most accessible at intermediate-to-late depth, and temporal controls show that disrupting frame order substantially reduces performance, especially on MVP. Together, these results suggest that intuitive-physics knowledge emerges reliably in pretrained video representations, but its accessibility depends strongly on pretraining paradigm, representational depth, and readout mechanism.
Key Contributions
This paper presents research in the following areas:
- cs.CV
- cs.AI
- cs.LG
Methodology
Please refer to the full paper for detailed methodology.
Practical Implications
This research contributes to the advancement of cs.CV.
Authors
- Samuele Punzo
- Niccolò Caselli
- Ippokratis Pantelidis
- Francesco Massafra
- Salvatore Lo Sardo
- Mohammadreza Salehi
Paper Information
- arXiv ID: 2606.09646v1
- Categories: cs.CV, cs.AI, cs.LG
- Published: June 8, 2026
- PDF: Download PDF