[Paper] HiAR: Efficient Autoregressive Long Video Generation via Hierarchical Denoising
Source: arXiv - 2603.08703v1
Overview
The paper HiAR: Efficient Autoregressive Long Video Generation via Hierarchical Denoising tackles a long‑standing problem in video synthesis: how to generate arbitrarily long videos with consistent motion and high visual quality without the quality collapse that typically occurs in autoregressive diffusion models. By rethinking when and how context frames are denoised, the authors introduce a hierarchical denoising scheme that both speeds up inference and dramatically reduces temporal drift.
Key Contributions
- Same‑noise‑level conditioning: Shows that conditioning on context frames at the same diffusion noise level as the current block is enough for temporal coherence, eliminating the need for fully‑denoised (high‑certainty) context that propagates errors.
- Hierarchical autoregressive (HiAR) framework: Reverses the classic generation order—rather than finishing one block before moving to the next, HiAR denoises all blocks in parallel at each diffusion step, keeping every block’s context at the same noise level.
- Pipelined parallel inference: The hierarchical design naturally enables a pipelined execution that yields a ~1.8× wall‑clock speedup in a 4‑step diffusion schedule.
- Forward‑KL regularizer for motion diversity: Introduces a bidirectional‑attention forward‑KL term that counteracts the low‑motion shortcut induced by the reverse‑KL (mode‑seeking) objective during self‑rollout distillation.
- State‑of‑the‑art results on VBench: Achieves the highest overall VBench score for 20‑second video generation and records the lowest temporal drift among all baselines.
Methodology
-
Autoregressive diffusion recap – Traditional AR diffusion generates a video block‑by‑block, always conditioning on the fully denoised previous blocks. This high‑certainty context makes the model overly confident in its past predictions, so any mistake quickly snowballs.
-
Key insight – same‑noise‑level context – Borrowing from bidirectional diffusion (where forward and backward passes share a common noise level), the authors argue that a noisy context provides just enough signal for continuity while keeping uncertainty high, which naturally dampens error accumulation.
-
Hierarchical denoising schedule – The video is split into several temporal blocks (e.g., 4‑second chunks). At each diffusion step (t) (from high noise to low noise), all blocks are denoised one step forward simultaneously. Consequently, each block sees its neighboring context at the identical noise level (t).
-
Parallel pipeline – Because each denoising step works on every block, the computation can be pipelined across GPUs or CPU cores: while block 1 is being processed at step (t), block 2 can already start step (t-1), etc. This yields the reported 1.8× speedup without sacrificing quality.
-
Self‑rollout distillation + forward‑KL regularizer – To further improve long‑range consistency, the model is distilled from its own rollouts (teacher‑student training). The reverse‑KL loss alone encourages the model to “play it safe” and generate low‑motion videos. The added forward‑KL term, computed with a bidirectional attention mask, explicitly rewards diverse motion patterns, balancing the two objectives.
-
Training details – The authors train on standard video diffusion datasets, use a 4‑step denoising schedule (much shorter than typical 100‑step diffusion), and adopt classifier‑free guidance for controllability.
Results & Findings
| Metric (VBench, 20 s) | HiAR (4‑step) | Prior AR Diffusion | Other SOTA |
|---|---|---|---|
| Overall Score | 0.78 (best) | 0.71 | 0.73‑0.75 |
| Temporal Drift (lower is better) | 0.12 (lowest) | 0.21 | 0.18‑0.20 |
| Inference Time (wall‑clock) | 1.8× faster than baseline 4‑step AR | – | – |
- Temporal coherence: The same‑noise‑level conditioning cuts the drift by ~40 % compared with the strongest baseline.
- Speed: With only four diffusion steps, HiAR reaches near‑real‑time generation for 20‑second clips, a dramatic improvement over the 50‑100 steps typical in diffusion video models.
- Motion diversity: Ablation of the forward‑KL regularizer shows a noticeable drop in motion variance (the model collapses to static frames), confirming its role in preserving dynamics.
Practical Implications
- Long‑form video generation for content creators: Developers can now generate minutes‑long clips with consistent motion using far fewer diffusion steps, making on‑device or cloud‑based services more feasible.
- Real‑time video augmentation: The pipelined inference design fits well with streaming pipelines (e.g., AR/VR overlays, live broadcast graphics) where latency is critical.
- Game asset synthesis: Game studios can employ HiAR to produce procedural cutscenes or background loops without worrying about drift over long durations.
- Efficient fine‑tuning: Because the model works with a short diffusion schedule, fine‑tuning on domain‑specific video data (e.g., medical imaging, industrial inspection) becomes computationally cheaper.
- API design: The hierarchical block interface maps naturally to chunked video APIs, allowing developers to request “next N seconds” while the backend continues denoising earlier chunks in parallel.
Limitations & Future Work
- Fixed block granularity: The current hierarchy assumes a uniform block size; adapting block lengths on‑the‑fly (e.g., to handle scene cuts) remains an open challenge.
- Four‑step schedule trade‑off: While 4 steps are fast, extremely high‑resolution or high‑frame‑rate videos may still benefit from more steps; scaling the hierarchical approach to longer schedules needs investigation.
- Forward‑KL computation cost: The bidirectional attention required for the forward‑KL regularizer adds memory overhead, which could be limiting on edge devices.
- Generalization to multimodal conditioning: The paper focuses on unconditional generation; extending HiAR to text‑to‑video or audio‑driven generation is a natural next step.
Overall, HiAR presents a compelling blend of algorithmic insight and engineering pragmatism, pushing autoregressive video diffusion toward real‑world deployment.
Authors
- Kai Zou
- Dian Zheng
- Hongbo Liu
- Tiankai Hang
- Bin Liu
- Nenghai Yu
Paper Information
- arXiv ID: 2603.08703v1
- Categories: cs.CV
- Published: March 9, 2026
- PDF: Download PDF