[Paper] Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion

Published: 3 weeks ago (December 29, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.23709v1

Overview

The paper introduces Stream‑DiffVSR, a diffusion‑based video‑super‑resolution (VSR) system that works online – it only looks at past frames and can upscale 720p video in roughly 0.33 s per frame on an RTX 4090. By redesigning the diffusion pipeline for causality and speed, the authors bridge the gap between the high perceptual quality of diffusion models and the low‑latency demands of real‑time applications such as streaming, video conferencing, and AR/VR.

Key Contributions

Causal diffusion framework: Guarantees that only previously received frames are used, enabling true streaming VSR.
Four‑step distilled denoiser: Compresses the usual dozens‑of‑steps diffusion process into just four inference steps, cutting latency by >130× compared to prior diffusion VSR.
Auto‑regressive Temporal Guidance (ARTG): Aligns motion information from past frames directly into the latent denoising stage, preserving temporal consistency without expensive optical‑flow post‑processing.
Temporal‑aware decoder with Temporal Processor Module (TPM): A lightweight head that refines spatial details while enforcing temporal coherence across frames.
State‑of‑the‑art performance: Beats the current online VSR leader (TMP) on perceptual metrics (LPIPS +0.095) while being dramatically faster, and sets the lowest reported latency for diffusion‑based VSR (0.328 s vs. >4600 s initial delay).

Methodology

Causal Conditioning – The model receives a sliding window of already‑generated high‑resolution frames and the current low‑resolution input. No future frames are accessed, which is essential for streaming.
Distilled Diffusion – Traditional diffusion needs 20‑100 denoising steps. The authors train a knowledge‑distilled denoiser that learns to approximate the full diffusion trajectory in just four steps, similar to how “fast diffusion” works for images.
Auto‑regressive Temporal Guidance (ARTG) – Before each denoising step, the latent representation is nudged with motion‑aligned features extracted from the previous high‑resolution outputs. This is done by warping past features using a lightweight motion estimator and injecting them as conditioning vectors.
Temporal Processor Module (TPM) – After the final denoising step, a compact decoder upsamples the latent to the target resolution. TPM incorporates a temporal attention block that looks at a short history (e.g., last 3 frames) to smooth flicker and reinforce fine details.
Training – The whole pipeline is trained end‑to‑end on high‑frame‑rate video datasets, with a perceptual loss (LPIPS), reconstruction loss (L1), and a temporal consistency loss that penalizes frame‑to‑frame differences.

Results & Findings

Metric (higher is better)	TMP (online SOTA)	Stream‑DiffVSR
LPIPS	0.215	0.120 (+0.095)
PSNR (dB)	27.8	28.3
Runtime per 720p frame	43 s (GPU)	0.328 s
Initial latency (first frame)	>4600 s	0.328 s

Perceptual quality: The LPIPS gain shows noticeably sharper textures and fewer artifacts, especially on high‑frequency regions like hair or foliage.
Temporal coherence: Visual inspection and the temporal consistency loss indicate far fewer flickering artifacts compared with naive frame‑by‑frame diffusion.
Speed: The four‑step distilled denoiser plus ARTG/TPM reduces the inference cost to a level comparable with traditional CNN‑based VSR, while still delivering diffusion‑level detail.

Practical Implications

Live streaming & video conferencing – Platforms can upscale low‑resolution streams on‑the‑fly without buffering future frames, delivering clearer video for bandwidth‑constrained users.
Edge‑AI devices – The lightweight decoder and limited diffusion steps make it feasible to run on high‑end consumer GPUs or even optimized on‑device accelerators (e.g., NVIDIA Jetson).
AR/VR content pipelines – Real‑time upscaling of 720p (or even 1080p) textures can improve visual fidelity in mixed‑reality applications where latency is a hard constraint.
Content creation tools – Editors can preview high‑quality upscaled footage instantly, accelerating workflows for VFX and post‑production.

Limitations & Future Work

Hardware dependence – The reported 0.328 s per frame is achieved on an RTX 4090; performance on more modest GPUs or CPUs will be slower, so further model compression may be needed for broader deployment.
Temporal window size – ARTG and TPM rely on a short history (typically 3‑5 frames). Extremely fast motion or long‑range dependencies could still cause occasional temporal artifacts.
Training data bias – The model is trained on publicly available video datasets; domain‑specific content (e.g., medical imaging, scientific visualization) may require fine‑tuning.
Future directions suggested by the authors include:
- Extending the causal diffusion idea to higher resolutions (4K) with hierarchical upscaling.
- Exploring adaptive step scheduling where easier frames use fewer diffusion steps.
- Integrating learned motion estimation that shares parameters with the ARTG module to reduce overhead.

Stream‑DiffVSR demonstrates that diffusion models are no longer confined to offline, batch‑processed video enhancement. By marrying causality, knowledge distillation, and clever temporal guidance, it opens the door for high‑quality, low‑latency VSR in real‑world applications.

Authors

Hau-Shiang Shiu
Chin-Yang Lin
Zhixiang Wang
Chi-Wei Hsiao
Po-Fan Yu
Yu-Chih Chen
Yu-Lun Liu

Paper Information

arXiv ID: 2512.23709v1
Categories: cs.CV
Published: December 29, 2025
PDF: Download PDF

[Paper] Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] UniX: Unifying Autoregression and Diffusion for Chest X-Ray Understanding and Generation

[Paper] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

[Paper] ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes

[Paper] CTest-Metric: A Unified Framework to Assess Clinical Validity of Metrics for CT Report Generation