[Paper] A Scale-Adaptive Framework for Joint Spatiotemporal Super-Resolution with Diffusion Models

Published: (April 23, 2026 at 01:49 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2604.21903v1

Overview

This paper introduces a scale‑adaptive framework for jointly increasing the spatial and temporal resolution of video‑like data (e.g., precipitation fields) using diffusion models. By decoupling a deterministic “mean” prediction from a stochastic residual, the same neural architecture can be re‑used across a wide range of up‑scaling factors—something that traditional spatiotemporal super‑resolution (SR) pipelines struggle to do.

Key Contributions

  • Unified architecture that works for any combination of spatial (×1‑×25) and temporal (×1‑×6) up‑scaling factors.
  • Deterministic‑plus‑diffusion decomposition: a conditional‑mean predictor with attention handles the bulk of the signal, while a conditional diffusion model captures the remaining uncertainty.
  • Scale‑adaptive hyper‑parameter recipe (noise schedule β, temporal context length L, optional mass‑conservation transform f) that requires only minimal retuning when changing SR factors.
  • Mass‑conservation transform to preserve total precipitation, preventing unrealistic amplification of extremes at high up‑scaling ratios.
  • Extensive validation on French reanalysis precipitation (Comephore), showing consistent quality across the full factor range.

Methodology

  1. Problem decomposition – The target high‑resolution sequence (Y) is expressed as
    [ Y = \underbrace{\mu_\theta(X)}{\text{deterministic mean}} + \underbrace{R\phi(X, \epsilon)}{\text{diffusion residual}}, ]
    where (X) is the low‑resolution input, (\mu
    \theta) is an attention‑based encoder‑decoder that predicts the conditional mean, and (R_\phi) is a conditional diffusion model that adds stochastic detail.

  2. Deterministic mean network – A transformer‑style architecture attends over a temporal window of length (L) (e.g., 5 frames for 1‑fps data, scaled up for slower cadences) to capture long‑range spatiotemporal dependencies.

  3. Conditional diffusion model – Starting from pure Gaussian noise, the model iteratively denoises conditioned on the low‑resolution input and the deterministic mean. The diffusion noise schedule amplitude (\beta) controls how much variability is injected; larger SR factors use a higher (\beta) to encourage diverse outputs.

  4. Mass‑conservation transform – An optional post‑processing function (f(\cdot)) rescales the residual so that the spatial sum of precipitation matches that of the input, mitigating the “blow‑up” of extreme values when up‑scaling by large factors.

  5. Scale‑adaptivity – To move from one SR factor to another, only three hyper‑parameters are adjusted:

    • β (larger for higher up‑scaling),
    • L (chosen so the attention horizon in physical time stays roughly constant),
    • f (tuned to limit extreme amplification).
      The network weights are re‑trained with these settings, but the architecture itself does not change.

Results & Findings

  • Quality across scales – Peak Signal‑to‑Noise Ratio (PSNR) and Structural Similarity (SSIM) degrade gracefully as the up‑scaling factor grows, staying competitive with factor‑specific baselines.
  • Uncertainty quantification – The diffusion residual yields realistic ensembles; variance grows with larger SR factors, reflecting the higher under‑determination.
  • Mass‑conservation impact – Applying (f) reduces mean absolute precipitation error by up to 15 % for the most extreme (×25 spatial, ×6 temporal) cases, without harming visual fidelity.
  • Training efficiency – Because the same backbone is reused, the total training time across all factor combinations is roughly 1.3× that of a single‑factor model, a substantial saving compared with training a separate model per factor.

Practical Implications

  • Climate & weather services can now generate high‑resolution forecasts from coarse reanalysis data without maintaining a zoo of specialized models for each resolution or cadence.
  • Video processing pipelines (e.g., surveillance, sports analytics) that need to upscale both frame rate and pixel density can adopt the deterministic‑plus‑diffusion recipe, gaining a single, maintainable codebase.
  • Developers get a clear hyper‑parameter tuning guide (β, L, f) that maps directly to the desired up‑scaling ratio, simplifying deployment in production environments.
  • Uncertainty‑aware decision making – The diffusion component naturally provides ensembles, enabling risk‑aware downstream tasks such as flood prediction or autonomous‑vehicle perception.

Limitations & Future Work

  • The approach still requires re‑training for each new factor set; true zero‑shot scaling (no retraining) remains an open challenge.
  • Experiments are limited to precipitation fields over France; broader geographic and modality validation (e.g., satellite imagery, medical video) is needed.
  • The mass‑conservation transform is handcrafted; learning a physics‑aware constraint jointly with the diffusion model could improve realism further.
  • Scaling to very high‑resolution video (e.g., 8K) may demand more efficient attention mechanisms or hierarchical diffusion steps, which the authors plan to explore.

Authors

  • Max Defez
  • Filippo Quarenghi
  • Mathieu Vrac
  • Stephan Mandt
  • Tom Beucler

Paper Information

  • arXiv ID: 2604.21903v1
  • Categories: cs.LG, cs.AI
  • Published: April 23, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »