[Paper] HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming
Source: arXiv - 2512.21338v1
Overview
HiStream tackles the long‑standing bottleneck of generating high‑resolution video with diffusion models. By systematically cutting out redundant computation across space, time, and denoising steps, the authors deliver a framework that can produce 1080p video at a fraction of the cost of existing methods—making truly high‑fidelity video synthesis practical for developers and media pipelines.
Key Contributions
- Spatial compression pipeline – Denoises a low‑resolution version first, then upsamples while re‑using cached high‑level features to avoid recomputing the full‑resolution diffusion for every frame.
- Temporal chunking with anchor cache – Processes video in fixed‑size chunks, keeping a small “anchor” cache that stabilizes generation across chunks and yields constant inference speed regardless of video length.
- Timestep compression for later chunks – Reduces the number of diffusion steps for chunks that are already conditioned on cached information, cutting compute without noticeable quality loss.
- Two model variants – HiStream (spatial + temporal optimizations) achieves up to 76× speed‑up with state‑of‑the‑art visual quality; HiStream+ (adds timestep compression) pushes the speed‑up to 107× with a modest trade‑off in fidelity.
- Extensive 1080p benchmark – Demonstrates superior perceptual quality (measured by FVD, LPIPS, and user studies) compared to the strong Wan2.1 baseline while dramatically reducing runtime.
Methodology
HiStream reframes high‑resolution video diffusion as an autoregressive streaming problem:
- Low‑resolution denoising – The model first runs a standard diffusion process on a down‑scaled video (e.g., 240p). This cheap pass captures the overall motion and coarse appearance.
- Feature caching – Intermediate latent features from the low‑res pass are stored. When the high‑resolution upsampling stage runs, it conditions on these cached features, so the expensive high‑res diffusion only needs to refine details rather than start from scratch.
- Chunk‑by‑chunk temporal processing – The video is split into overlapping chunks (e.g., 8 frames). An “anchor” frame (or a few frames) is kept in a fixed‑size cache and reused across neighboring chunks, ensuring temporal consistency while keeping memory bounded.
- Reduced timesteps for later chunks – Because later chunks already inherit context from the anchor cache, the diffusion schedule can be shortened (fewer denoising steps), further slashing compute.
All three tricks are orthogonal and can be combined, which is why HiStream+ stacks them for maximal speed.
Results & Findings
| Model | Resolution | FVD ↓ (lower better) | LPIPS ↓ | Speedup vs. Wan2.1 |
|---|---|---|---|---|
| Wan2.1 (baseline) | 1080p | 210 | 0.31 | 1× |
| HiStream (i + ii) | 1080p | 188 | 0.28 | ≈ 76× |
| HiStream+ (i + ii + iii) | 1080p | 200 | 0.30 | ≈ 107× |
- Visual quality: User studies showed >85 % preference for HiStream over the baseline, despite the massive speed gain.
- Scalability: Inference time stays roughly constant as video length grows, thanks to the fixed‑size anchor cache.
- Memory footprint: The caching strategy reduces GPU memory usage by ~40 % compared with naïve full‑resolution diffusion.
Practical Implications
- Content creation pipelines – Studios and indie developers can now generate 1080p (or higher) video on a single GPU in minutes rather than hours, opening doors for rapid prototyping, AI‑assisted VFX, and on‑the‑fly video synthesis in games.
- Real‑time or near‑real‑time applications – The streaming nature of HiStream makes it suitable for interactive tools (e.g., AI‑driven video editors, live‑stream overlays) where latency is critical.
- Edge deployment – Because the heavy diffusion work is done at low resolution and the high‑res refinement reuses cached features, the approach can be split across devices (e.g., low‑res on a server, high‑res upsampling on a local workstation).
- Cost reduction – The 70‑100× speedup translates directly into lower cloud GPU bills, making large‑scale video generation economically viable for SaaS platforms.
Limitations & Future Work
- Quality trade‑off in HiStream+ – The additional timestep compression yields a noticeable but still modest drop in perceptual metrics; fine‑tuning the schedule per domain may be required.
- Cache size vs. temporal fidelity – A fixed anchor cache works well for moderate motion but may struggle with very fast or highly dynamic scenes; adaptive cache sizing could improve robustness.
- Generalization to ultra‑high resolutions (4K/8K) – The authors note that the current spatial compression pipeline still incurs memory spikes at extreme resolutions, suggesting the need for hierarchical or multi‑scale diffusion strategies.
- Broader modality testing – Experiments focus on natural video; extending to animation, medical imaging, or synthetic data streams remains an open avenue.
Overall, HiStream marks a significant step toward making high‑resolution video diffusion practical for developers and industry, while leaving clear pathways for further refinement and broader adoption.
Authors
- Haonan Qiu
- Shikun Liu
- Zijian Zhou
- Zhaochong An
- Weiming Ren
- Zhiheng Liu
- Jonas Schult
- Sen He
- Shoufa Chen
- Yuren Cong
- Tao Xiang
- Ziwei Liu
- Juan‑Manuel Perez‑Rua
Paper Information
- arXiv ID: 2512.21338v1
- Categories: cs.CV
- Published: December 24, 2025
- PDF: Download PDF