[Paper] A Scale-Adaptive Framework for Joint Spatiotemporal Super-Resolution with Diffusion Models

Published: 1 day ago (April 23, 2026 at 01:49 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.21903v1

Overview

This paper introduces a scale‑adaptive framework for jointly increasing the spatial and temporal resolution of video‑like data (e.g., precipitation fields) using diffusion models. By decoupling a deterministic “mean” prediction from a stochastic residual, the same neural architecture can be re‑used across a wide range of up‑scaling factors—something that traditional spatiotemporal super‑resolution (SR) pipelines struggle to do.

Key Contributions

Unified architecture that works for any combination of spatial (×1‑×25) and temporal (×1‑×6) up‑scaling factors.
Deterministic‑plus‑diffusion decomposition: a conditional‑mean predictor with attention handles the bulk of the signal, while a conditional diffusion model captures the remaining uncertainty.
Scale‑adaptive hyper‑parameter recipe (noise schedule β, temporal context length L, optional mass‑conservation transform f) that requires only minimal retuning when changing SR factors.
Mass‑conservation transform to preserve total precipitation, preventing unrealistic amplification of extremes at high up‑scaling ratios.
Extensive validation on French reanalysis precipitation (Comephore), showing consistent quality across the full factor range.

Methodology

Problem decomposition – The target high‑resolution sequence (Y) is expressed as
[ Y = \underbrace{\mu_\theta(X)}{\text{deterministic mean}} + \underbrace{R\phi(X, \epsilon)}{\text{diffusion residual}}, ]
where (X) is the low‑resolution input, (\mu\theta) is an attention‑based encoder‑decoder that predicts the conditional mean, and (R_\phi) is a conditional diffusion model that adds stochastic detail.
Deterministic mean network – A transformer‑style architecture attends over a temporal window of length (L) (e.g., 5 frames for 1‑fps data, scaled up for slower cadences) to capture long‑range spatiotemporal dependencies.
Conditional diffusion model – Starting from pure Gaussian noise, the model iteratively denoises conditioned on the low‑resolution input and the deterministic mean. The diffusion noise schedule amplitude (\beta) controls how much variability is injected; larger SR factors use a higher (\beta) to encourage diverse outputs.
Mass‑conservation transform – An optional post‑processing function (f(\cdot)) rescales the residual so that the spatial sum of precipitation matches that of the input, mitigating the “blow‑up” of extreme values when up‑scaling by large factors.
Scale‑adaptivity – To move from one SR factor to another, only three hyper‑parameters are adjusted:
- β (larger for higher up‑scaling),
- L (chosen so the attention horizon in physical time stays roughly constant),
- f (tuned to limit extreme amplification).
  The network weights are re‑trained with these settings, but the architecture itself does not change.

Results & Findings

Quality across scales – Peak Signal‑to‑Noise Ratio (PSNR) and Structural Similarity (SSIM) degrade gracefully as the up‑scaling factor grows, staying competitive with factor‑specific baselines.
Uncertainty quantification – The diffusion residual yields realistic ensembles; variance grows with larger SR factors, reflecting the higher under‑determination.
Mass‑conservation impact – Applying (f) reduces mean absolute precipitation error by up to 15 % for the most extreme (×25 spatial, ×6 temporal) cases, without harming visual fidelity.
Training efficiency – Because the same backbone is reused, the total training time across all factor combinations is roughly 1.3× that of a single‑factor model, a substantial saving compared with training a separate model per factor.

Practical Implications

Climate & weather services can now generate high‑resolution forecasts from coarse reanalysis data without maintaining a zoo of specialized models for each resolution or cadence.
Video processing pipelines (e.g., surveillance, sports analytics) that need to upscale both frame rate and pixel density can adopt the deterministic‑plus‑diffusion recipe, gaining a single, maintainable codebase.
Developers get a clear hyper‑parameter tuning guide (β, L, f) that maps directly to the desired up‑scaling ratio, simplifying deployment in production environments.
Uncertainty‑aware decision making – The diffusion component naturally provides ensembles, enabling risk‑aware downstream tasks such as flood prediction or autonomous‑vehicle perception.

Limitations & Future Work

The approach still requires re‑training for each new factor set; true zero‑shot scaling (no retraining) remains an open challenge.
Experiments are limited to precipitation fields over France; broader geographic and modality validation (e.g., satellite imagery, medical video) is needed.
The mass‑conservation transform is handcrafted; learning a physics‑aware constraint jointly with the diffusion model could improve realism further.
Scaling to very high‑resolution video (e.g., 8K) may demand more efficient attention mechanisms or hierarchical diffusion steps, which the authors plan to explore.

Authors

Max Defez
Filippo Quarenghi
Mathieu Vrac
Stephan Mandt
Tom Beucler

Paper Information

arXiv ID: 2604.21903v1
Categories: cs.LG, cs.AI
Published: April 23, 2026
PDF: Download PDF

[Paper] A Scale-Adaptive Framework for Joint Spatiotemporal Super-Resolution with Diffusion Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Seeing Fast and Slow: Learning the Flow of Time in Videos

[Paper] Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability

[Paper] Fine-Tuning Regimes Define Distinct Continual Learning Problems

[Paper] The Sample Complexity of Multicalibration