[Paper] HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming

Published: 1 month ago (December 24, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.21338v1

Overview

HiStream tackles the long‑standing bottleneck of generating high‑resolution video with diffusion models. By systematically cutting out redundant computation across space, time, and denoising steps, the authors deliver a framework that can produce 1080p video at a fraction of the cost of existing methods—making truly high‑fidelity video synthesis practical for developers and media pipelines.

Key Contributions

Spatial compression pipeline – Denoises a low‑resolution version first, then upsamples while re‑using cached high‑level features to avoid recomputing the full‑resolution diffusion for every frame.
Temporal chunking with anchor cache – Processes video in fixed‑size chunks, keeping a small “anchor” cache that stabilizes generation across chunks and yields constant inference speed regardless of video length.
Timestep compression for later chunks – Reduces the number of diffusion steps for chunks that are already conditioned on cached information, cutting compute without noticeable quality loss.
Two model variants – HiStream (spatial + temporal optimizations) achieves up to 76× speed‑up with state‑of‑the‑art visual quality; HiStream+ (adds timestep compression) pushes the speed‑up to 107× with a modest trade‑off in fidelity.
Extensive 1080p benchmark – Demonstrates superior perceptual quality (measured by FVD, LPIPS, and user studies) compared to the strong Wan2.1 baseline while dramatically reducing runtime.

Methodology

HiStream reframes high‑resolution video diffusion as an autoregressive streaming problem:

Low‑resolution denoising – The model first runs a standard diffusion process on a down‑scaled video (e.g., 240p). This cheap pass captures the overall motion and coarse appearance.
Feature caching – Intermediate latent features from the low‑res pass are stored. When the high‑resolution upsampling stage runs, it conditions on these cached features, so the expensive high‑res diffusion only needs to refine details rather than start from scratch.
Chunk‑by‑chunk temporal processing – The video is split into overlapping chunks (e.g., 8 frames). An “anchor” frame (or a few frames) is kept in a fixed‑size cache and reused across neighboring chunks, ensuring temporal consistency while keeping memory bounded.
Reduced timesteps for later chunks – Because later chunks already inherit context from the anchor cache, the diffusion schedule can be shortened (fewer denoising steps), further slashing compute.

All three tricks are orthogonal and can be combined, which is why HiStream+ stacks them for maximal speed.

Results & Findings

Model	Resolution	FVD ↓ (lower better)	LPIPS ↓	Speedup vs. Wan2.1
Wan2.1 (baseline)	1080p	210	0.31	1×
HiStream (i + ii)	1080p	188	0.28	≈ 76×
HiStream+ (i + ii + iii)	1080p	200	0.30	≈ 107×

Visual quality: User studies showed >85 % preference for HiStream over the baseline, despite the massive speed gain.
Scalability: Inference time stays roughly constant as video length grows, thanks to the fixed‑size anchor cache.
Memory footprint: The caching strategy reduces GPU memory usage by ~40 % compared with naïve full‑resolution diffusion.

Practical Implications

Content creation pipelines – Studios and indie developers can now generate 1080p (or higher) video on a single GPU in minutes rather than hours, opening doors for rapid prototyping, AI‑assisted VFX, and on‑the‑fly video synthesis in games.
Real‑time or near‑real‑time applications – The streaming nature of HiStream makes it suitable for interactive tools (e.g., AI‑driven video editors, live‑stream overlays) where latency is critical.
Edge deployment – Because the heavy diffusion work is done at low resolution and the high‑res refinement reuses cached features, the approach can be split across devices (e.g., low‑res on a server, high‑res upsampling on a local workstation).
Cost reduction – The 70‑100× speedup translates directly into lower cloud GPU bills, making large‑scale video generation economically viable for SaaS platforms.

Limitations & Future Work

Quality trade‑off in HiStream+ – The additional timestep compression yields a noticeable but still modest drop in perceptual metrics; fine‑tuning the schedule per domain may be required.
Cache size vs. temporal fidelity – A fixed anchor cache works well for moderate motion but may struggle with very fast or highly dynamic scenes; adaptive cache sizing could improve robustness.
Generalization to ultra‑high resolutions (4K/8K) – The authors note that the current spatial compression pipeline still incurs memory spikes at extreme resolutions, suggesting the need for hierarchical or multi‑scale diffusion strategies.
Broader modality testing – Experiments focus on natural video; extending to animation, medical imaging, or synthetic data streams remains an open avenue.

Overall, HiStream marks a significant step toward making high‑resolution video diffusion practical for developers and industry, while leaving clear pathways for further refinement and broader adoption.

Authors

Haonan Qiu
Shikun Liu
Zijian Zhou
Zhaochong An
Weiming Ren
Zhiheng Liu
Jonas Schult
Sen He
Shoufa Chen
Yuren Cong
Tao Xiang
Ziwei Liu
Juan‑Manuel Perez‑Rua

Paper Information

arXiv ID: 2512.21338v1
Categories: cs.CV
Published: December 24, 2025
PDF: Download PDF

[Paper] HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

[Paper] ProEdit: Inversion-based Editing From Prompts Done Right

[Paper] Learning Association via Track-Detection Matching for Multi-Object Tracking

[Paper] Yume-1.5: A Text-Controlled Interactive World Generation Model