[Paper] HiAR: Efficient Autoregressive Long Video Generation via Hierarchical Denoising

Published: 10 hours ago (March 9, 2026 at 01:58 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2603.08703v1

Overview

The paper HiAR: Efficient Autoregressive Long Video Generation via Hierarchical Denoising tackles a long‑standing problem in video synthesis: how to generate arbitrarily long videos with consistent motion and high visual quality without the quality collapse that typically occurs in autoregressive diffusion models. By rethinking when and how context frames are denoised, the authors introduce a hierarchical denoising scheme that both speeds up inference and dramatically reduces temporal drift.

Key Contributions

Same‑noise‑level conditioning: Shows that conditioning on context frames at the same diffusion noise level as the current block is enough for temporal coherence, eliminating the need for fully‑denoised (high‑certainty) context that propagates errors.
Hierarchical autoregressive (HiAR) framework: Reverses the classic generation order—rather than finishing one block before moving to the next, HiAR denoises all blocks in parallel at each diffusion step, keeping every block’s context at the same noise level.
Pipelined parallel inference: The hierarchical design naturally enables a pipelined execution that yields a ~1.8× wall‑clock speedup in a 4‑step diffusion schedule.
Forward‑KL regularizer for motion diversity: Introduces a bidirectional‑attention forward‑KL term that counteracts the low‑motion shortcut induced by the reverse‑KL (mode‑seeking) objective during self‑rollout distillation.
State‑of‑the‑art results on VBench: Achieves the highest overall VBench score for 20‑second video generation and records the lowest temporal drift among all baselines.

Methodology

Autoregressive diffusion recap – Traditional AR diffusion generates a video block‑by‑block, always conditioning on the fully denoised previous blocks. This high‑certainty context makes the model overly confident in its past predictions, so any mistake quickly snowballs.
Key insight – same‑noise‑level context – Borrowing from bidirectional diffusion (where forward and backward passes share a common noise level), the authors argue that a noisy context provides just enough signal for continuity while keeping uncertainty high, which naturally dampens error accumulation.
Hierarchical denoising schedule – The video is split into several temporal blocks (e.g., 4‑second chunks). At each diffusion step (t) (from high noise to low noise), all blocks are denoised one step forward simultaneously. Consequently, each block sees its neighboring context at the identical noise level (t).
Parallel pipeline – Because each denoising step works on every block, the computation can be pipelined across GPUs or CPU cores: while block 1 is being processed at step (t), block 2 can already start step (t-1), etc. This yields the reported 1.8× speedup without sacrificing quality.
Self‑rollout distillation + forward‑KL regularizer – To further improve long‑range consistency, the model is distilled from its own rollouts (teacher‑student training). The reverse‑KL loss alone encourages the model to “play it safe” and generate low‑motion videos. The added forward‑KL term, computed with a bidirectional attention mask, explicitly rewards diverse motion patterns, balancing the two objectives.
Training details – The authors train on standard video diffusion datasets, use a 4‑step denoising schedule (much shorter than typical 100‑step diffusion), and adopt classifier‑free guidance for controllability.

Results & Findings

Metric (VBench, 20 s)	HiAR (4‑step)	Prior AR Diffusion	Other SOTA
Overall Score	0.78 (best)	0.71	0.73‑0.75
Temporal Drift (lower is better)	0.12 (lowest)	0.21	0.18‑0.20
Inference Time (wall‑clock)	1.8× faster than baseline 4‑step AR	–	–

Temporal coherence: The same‑noise‑level conditioning cuts the drift by ~40 % compared with the strongest baseline.
Speed: With only four diffusion steps, HiAR reaches near‑real‑time generation for 20‑second clips, a dramatic improvement over the 50‑100 steps typical in diffusion video models.
Motion diversity: Ablation of the forward‑KL regularizer shows a noticeable drop in motion variance (the model collapses to static frames), confirming its role in preserving dynamics.

Practical Implications

Long‑form video generation for content creators: Developers can now generate minutes‑long clips with consistent motion using far fewer diffusion steps, making on‑device or cloud‑based services more feasible.
Real‑time video augmentation: The pipelined inference design fits well with streaming pipelines (e.g., AR/VR overlays, live broadcast graphics) where latency is critical.
Game asset synthesis: Game studios can employ HiAR to produce procedural cutscenes or background loops without worrying about drift over long durations.
Efficient fine‑tuning: Because the model works with a short diffusion schedule, fine‑tuning on domain‑specific video data (e.g., medical imaging, industrial inspection) becomes computationally cheaper.
API design: The hierarchical block interface maps naturally to chunked video APIs, allowing developers to request “next N seconds” while the backend continues denoising earlier chunks in parallel.

Limitations & Future Work

Fixed block granularity: The current hierarchy assumes a uniform block size; adapting block lengths on‑the‑fly (e.g., to handle scene cuts) remains an open challenge.
Four‑step schedule trade‑off: While 4 steps are fast, extremely high‑resolution or high‑frame‑rate videos may still benefit from more steps; scaling the hierarchical approach to longer schedules needs investigation.
Forward‑KL computation cost: The bidirectional attention required for the forward‑KL regularizer adds memory overhead, which could be limiting on edge devices.
Generalization to multimodal conditioning: The paper focuses on unconditional generation; extending HiAR to text‑to‑video or audio‑driven generation is a natural next step.

Overall, HiAR presents a compelling blend of algorithmic insight and engineering pragmatism, pushing autoregressive video diffusion toward real‑world deployment.

Authors

Kai Zou
Dian Zheng
Hongbo Liu
Tiankai Hang
Bin Liu
Nenghai Yu

Paper Information

arXiv ID: 2603.08703v1
Categories: cs.CV
Published: March 9, 2026
PDF: Download PDF

[Paper] HiAR: Efficient Autoregressive Long Video Generation via Hierarchical Denoising

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Scale Space Diffusion

[Paper] FVG-PT: Adaptive Foreground View-Guided Prompt Tuning for Vision-Language Models

[Paper] ER-Pose: Rethinking Keypoint-Driven Representation Learning for Real-Time Human Pose Estimation

[Paper] Talking Together: Synthesizing Co-Located 3D Conversations from Audio