[Paper] Do Video Foundation Models Understand Intuitive Physics? A Layerwise Probing Analysis

Published: 3 days ago (June 8, 2026 at 11:40 AM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.09646v1

Overview

We study whether pretrained video foundation models encode intuitive-physics information in their frozen representations, and how this information varies across model families, layers, and probe types. Using frozen-feature probing on IntPhys2 and Minimal Video Pairs (MVP), we compare predictive joint-embedding models (V-JEPA), masked reconstruction models (VideoMAE), and a diffusion-based video generator (LTX-Video). V-JEPA achieves the strongest overall results across benchmarks, especially with probes that model temporal dynamics, while VideoMAE remains competitive and LTX-Video recovers weaker but non-trivial signal. Layerwise analyses show that physics-relevant information is weakest in early layers and becomes most accessible at intermediate-to-late depth, and temporal controls show that disrupting frame order substantially reduces performance, especially on MVP. Together, these results suggest that intuitive-physics knowledge emerges reliably in pretrained video representations, but its accessibility depends strongly on pretraining paradigm, representational depth, and readout mechanism.

Key Contributions

This paper presents research in the following areas:

cs.CV
cs.AI
cs.LG

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CV.

Authors

Samuele Punzo
Niccolò Caselli
Ippokratis Pantelidis
Francesco Massafra
Salvatore Lo Sardo
Mohammadreza Salehi

Paper Information

arXiv ID: 2606.09646v1
Categories: cs.CV, cs.AI, cs.LG
Published: June 8, 2026
PDF: Download PDF

[Paper] Do Video Foundation Models Understand Intuitive Physics? A Layerwise Probing Analysis

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

[Paper] DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?

[Paper] Illumination-Robust Camera-Based Heart-Rate Estimation for Physiological Sensing in Robots

[Paper] Atlas H&E-TME: Scalable AI-Based Tissue Profiling at Expert Pathologist-Level Accuracy

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

[Paper] DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?

[Paper] Illumination-Robust Camera-Based Heart-Rate Estimation for Physiological Sensing in Robots

[Paper] Atlas H&amp;E-TME: Scalable AI-Based Tissue Profiling at Expert Pathologist-Level Accuracy

[Paper] Atlas H&E-TME: Scalable AI-Based Tissue Profiling at Expert Pathologist-Level Accuracy