[Paper] A Scale-Adaptive Framework for Joint Spatiotemporal Super-Resolution with Diffusion Models
Source: arXiv - 2604.21903v1
Overview
This paper introduces a scale‑adaptive framework for jointly increasing the spatial and temporal resolution of video‑like data (e.g., precipitation fields) using diffusion models. By decoupling a deterministic “mean” prediction from a stochastic residual, the same neural architecture can be re‑used across a wide range of up‑scaling factors—something that traditional spatiotemporal super‑resolution (SR) pipelines struggle to do.
Key Contributions
- Unified architecture that works for any combination of spatial (×1‑×25) and temporal (×1‑×6) up‑scaling factors.
- Deterministic‑plus‑diffusion decomposition: a conditional‑mean predictor with attention handles the bulk of the signal, while a conditional diffusion model captures the remaining uncertainty.
- Scale‑adaptive hyper‑parameter recipe (noise schedule β, temporal context length L, optional mass‑conservation transform f) that requires only minimal retuning when changing SR factors.
- Mass‑conservation transform to preserve total precipitation, preventing unrealistic amplification of extremes at high up‑scaling ratios.
- Extensive validation on French reanalysis precipitation (Comephore), showing consistent quality across the full factor range.
Methodology
-
Problem decomposition – The target high‑resolution sequence (Y) is expressed as
[ Y = \underbrace{\mu_\theta(X)}{\text{deterministic mean}} + \underbrace{R\phi(X, \epsilon)}{\text{diffusion residual}}, ]
where (X) is the low‑resolution input, (\mu\theta) is an attention‑based encoder‑decoder that predicts the conditional mean, and (R_\phi) is a conditional diffusion model that adds stochastic detail. -
Deterministic mean network – A transformer‑style architecture attends over a temporal window of length (L) (e.g., 5 frames for 1‑fps data, scaled up for slower cadences) to capture long‑range spatiotemporal dependencies.
-
Conditional diffusion model – Starting from pure Gaussian noise, the model iteratively denoises conditioned on the low‑resolution input and the deterministic mean. The diffusion noise schedule amplitude (\beta) controls how much variability is injected; larger SR factors use a higher (\beta) to encourage diverse outputs.
-
Mass‑conservation transform – An optional post‑processing function (f(\cdot)) rescales the residual so that the spatial sum of precipitation matches that of the input, mitigating the “blow‑up” of extreme values when up‑scaling by large factors.
-
Scale‑adaptivity – To move from one SR factor to another, only three hyper‑parameters are adjusted:
- β (larger for higher up‑scaling),
- L (chosen so the attention horizon in physical time stays roughly constant),
- f (tuned to limit extreme amplification).
The network weights are re‑trained with these settings, but the architecture itself does not change.
Results & Findings
- Quality across scales – Peak Signal‑to‑Noise Ratio (PSNR) and Structural Similarity (SSIM) degrade gracefully as the up‑scaling factor grows, staying competitive with factor‑specific baselines.
- Uncertainty quantification – The diffusion residual yields realistic ensembles; variance grows with larger SR factors, reflecting the higher under‑determination.
- Mass‑conservation impact – Applying (f) reduces mean absolute precipitation error by up to 15 % for the most extreme (×25 spatial, ×6 temporal) cases, without harming visual fidelity.
- Training efficiency – Because the same backbone is reused, the total training time across all factor combinations is roughly 1.3× that of a single‑factor model, a substantial saving compared with training a separate model per factor.
Practical Implications
- Climate & weather services can now generate high‑resolution forecasts from coarse reanalysis data without maintaining a zoo of specialized models for each resolution or cadence.
- Video processing pipelines (e.g., surveillance, sports analytics) that need to upscale both frame rate and pixel density can adopt the deterministic‑plus‑diffusion recipe, gaining a single, maintainable codebase.
- Developers get a clear hyper‑parameter tuning guide (β, L, f) that maps directly to the desired up‑scaling ratio, simplifying deployment in production environments.
- Uncertainty‑aware decision making – The diffusion component naturally provides ensembles, enabling risk‑aware downstream tasks such as flood prediction or autonomous‑vehicle perception.
Limitations & Future Work
- The approach still requires re‑training for each new factor set; true zero‑shot scaling (no retraining) remains an open challenge.
- Experiments are limited to precipitation fields over France; broader geographic and modality validation (e.g., satellite imagery, medical video) is needed.
- The mass‑conservation transform is handcrafted; learning a physics‑aware constraint jointly with the diffusion model could improve realism further.
- Scaling to very high‑resolution video (e.g., 8K) may demand more efficient attention mechanisms or hierarchical diffusion steps, which the authors plan to explore.
Authors
- Max Defez
- Filippo Quarenghi
- Mathieu Vrac
- Stephan Mandt
- Tom Beucler
Paper Information
- arXiv ID: 2604.21903v1
- Categories: cs.LG, cs.AI
- Published: April 23, 2026
- PDF: Download PDF