[Paper] Representation Learning for Spatiotemporal Physical Systems

Published: 1 month ago (March 13, 2026 at 01:59 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2603.13227v1

Overview

This paper investigates how well modern self‑supervised learning (SSL) techniques can capture the underlying physics of spatiotemporal systems—think fluid flows, weather patterns, or particle simulations. Instead of stopping at the usual “predict the next video frame” task, the authors ask whether the learned representations are useful for downstream scientific questions such as estimating hidden physical parameters.

Key Contributions

Shifted evaluation paradigm: Introduces downstream scientific tasks (parameter estimation) as a more meaningful benchmark for representation quality than raw next‑frame prediction.
Comprehensive comparison: Empirically evaluates a suite of generic SSL methods (e.g., SimCLR, MoCo) against physics‑specific architectures (e.g., JEPAs) across several simulated physical datasets.
Latent‑space advantage: Shows that methods learning in a latent embedding space consistently outperform pixel‑level predictive models for downstream tasks.
Open‑source toolkit: Releases a well‑documented codebase (github.com/helenqu/physical-representation-learning) that reproduces experiments and can be extended to new physical domains.

Methodology

Datasets: The authors use a collection of synthetic spatiotemporal simulations (e.g., Navier‑Stokes fluid flow, wave propagation, particle dynamics) where ground‑truth physical parameters (viscosity, wave speed, force fields) are known.
Self‑supervised pre‑training: Models are first trained without labels using either:
- Pixel‑level predictive objectives (predict the next frame directly).
- Latent‑space objectives (joint embedding predictive architectures, contrastive learning, masked autoencoding).
Downstream probing: After pre‑training, a lightweight linear probe (or small MLP) is trained on the frozen embeddings to predict the hidden physical parameters. Performance is measured by mean‑squared error or classification accuracy, depending on the parameter type.
Baselines: Classic next‑frame prediction networks (e.g., ConvLSTM, video diffusion models) serve as baselines to illustrate the gap between raw prediction quality and representation usefulness.

Results & Findings

Generic SSL beats physics‑specific predictors: Methods like SimCLR and masked autoencoders, originally designed for natural images, achieve higher accuracy on parameter estimation than dedicated next‑frame predictors.
JEPAs lead the pack: Joint embedding predictive architectures, which learn to map consecutive frames into a shared latent space and predict future embeddings, consistently outperform both generic SSL and pixel‑level models.
Error compounding is less of an issue: Since downstream tasks rely on a single forward pass through the encoder, the autoregressive roll‑out errors that plague frame‑prediction models have minimal impact.
Representation quality correlates with downstream performance: Higher linear probe scores align with embeddings that preserve physically relevant invariants (e.g., conservation laws), confirming that the evaluation metric is a good proxy for “physics‑groundedness.”

Practical Implications

Faster scientific pipelines: Researchers can pre‑train a universal encoder on large unlabeled simulation data and then reuse it to quickly estimate hidden parameters for new experiments, cutting down on costly simulation runs.
Model selection for engineering tools: When building ML‑augmented simulators (e.g., for CFD or climate modeling), focusing on latent‑space SSL may yield more robust, interpretable components than striving for pixel‑perfect predictions.
Transfer to real‑world data: Because the evaluated SSL methods are not tied to a specific physics engine, the same encoders could be fine‑tuned on real sensor streams (e.g., satellite imagery, medical imaging) to extract physical descriptors without extensive labeled datasets.
Reduced compute budget: Latent‑space models tend to be lighter than full video prediction networks, making them attractive for edge deployment (e.g., on‑board diagnostics for drones or autonomous vehicles).

Limitations & Future Work

Synthetic focus: All experiments use simulated data; real‑world noise, measurement errors, and partial observability may affect performance.
Limited physics diversity: The study covers a handful of PDE‑based systems; extending to chaotic or multi‑scale phenomena (e.g., turbulence) remains open.
Probe simplicity: Linear probes may underestimate the full potential of the embeddings; exploring deeper fine‑tuning strategies could reveal additional gains.
Interpretability: While the embeddings capture physical parameters, the paper does not provide tools for visualizing or interpreting the learned latent space in domain‑specific terms.

Overall, the work suggests that developers building ML tools for physical simulation should consider self‑supervised latent‑space learning as a more efficient and physically faithful alternative to traditional next‑frame prediction models.

Authors

Helen Qu
Rudy Morel
Michael McCabe
Alberto Bietti
François Lanusse
Shirley Ho
Yann LeCun

Paper Information

arXiv ID: 2603.13227v1
Categories: cs.LG, cs.CV
Published: March 13, 2026
PDF: Download PDF

[Paper] Representation Learning for Spatiotemporal Physical Systems

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization

[Paper] Visual-ERM: Reward Modeling for Visual Equivalence

[Paper] Towards Faithful Multimodal Concept Bottleneck Models

[Paper] Influence Malleability in Linearized Attention: Dual Implications of Non-Convergent NTK Dynamics