[Paper] VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction
Source: arXiv - 2601.05966v1
Overview
The paper presents VideoAR, a new autoregressive framework for generating videos that rivals the quality of diffusion‑based models while being far more efficient. By combining multi‑scale next‑frame prediction with a visual autoregressive tokenizer, the authors achieve high‑fidelity, temporally coherent video synthesis with dramatically fewer inference steps.
Key Contributions
- First large‑scale Visual Autoregressive (VAR) video model that jointly handles spatial and temporal dependencies.
- 3‑D multi‑scale tokenizer that compresses spatio‑temporal information into discrete tokens, enabling fast autoregressive decoding.
- Temporal stability tricks: Multi‑scale Temporal RoPE, Cross‑Frame Error Correction, and Random Frame Mask to curb error accumulation over long sequences.
- Multi‑stage pre‑training pipeline that progressively scales resolution and duration, aligning spatial and temporal learning.
- State‑of‑the‑art results for autoregressive video generation: FVD 88.6 on UCF‑101 (vs. 99.5 previously) and VBench 81.74, with >10× fewer inference steps than diffusion baselines.
Methodology
-
Tokenization – A 3‑D tokenizer slices a video into a hierarchy of discrete tokens at multiple spatial scales (e.g., 8×8, 16×16 patches) and temporal strides. This compact representation captures both appearance and motion while keeping the sequence length manageable.
-
Autoregressive Modeling – The model treats video generation as a two‑fold problem:
- Intra‑frame VAR: predicts the next token within the current frame, preserving spatial structure.
- Causal next‑frame prediction: forecasts the token set for the upcoming frame, ensuring temporal causality.
-
Temporal RoPE & Error Correction – Rotary Positional Embeddings (RoPE) are extended across scales to encode relative time, and a lightweight cross‑frame error‑correction module revisits earlier predictions to fix drift.
-
Training Regimen – A staged curriculum starts with low‑resolution, short‑clip videos, then gradually increases resolution and clip length. Random frame masking forces the model to learn robust reconstruction, further reducing error propagation.
-
Inference – Generation proceeds token‑by‑token (or block‑by‑block) across frames, but because the token vocabulary is compact, only a handful of decoding steps are needed to produce a full‑length video.
Results & Findings
| Metric | Prior Autoregressive | VideoAR | Diffusion (large) |
|---|---|---|---|
| FVD (UCF‑101) | 99.5 | 88.6 | ~85 |
| VBench Score | 73.2 | 81.74 | 82–84 |
| Inference Steps | ~1000 | ≈90 | ~1000+ |
| Compute (GPU‑hrs) | 1.2× | 0.8× | 1.0× (larger model) |
- VideoAR closes the quality gap with diffusion models while cutting inference time by more than an order of magnitude.
- The introduced temporal mechanisms significantly reduce flickering and drift, yielding smoother long‑range motion.
- Ablation studies confirm that each component (Multi‑scale RoPE, Error Correction, Random Mask) contributes measurable gains in FVD and VBench.
Practical Implications
- Faster Prototyping – Developers can generate high‑quality video samples in seconds on a single GPU, enabling rapid iteration for content creation, game asset pipelines, or synthetic data generation.
- Scalable Deployment – The token‑based autoregressive design fits well with existing transformer serving stacks (e.g., ONNX, TensorRT), making it easier to integrate into production services compared to memory‑heavy diffusion pipelines.
- Temporal Consistency – Applications that require coherent motion—such as virtual avatars, video‑to‑video translation, or training data for video‑based perception models—benefit from the reduced error propagation.
- Resource‑Constrained Environments – Because inference is lightweight, VideoAR can run on edge devices or cloud‑cost‑optimized instances, opening doors for real‑time video synthesis in AR/VR or live streaming contexts.
Limitations & Future Work
- Resolution Ceiling – While the multi‑scale tokenizer helps, generating ultra‑high‑definition (4K+) videos still strains the token budget and may require further hierarchical designs.
- Long‑Term Dependencies – Although temporal RoPE and correction mitigate drift, very long clips (>10 seconds) can still exhibit subtle inconsistencies.
- Domain Generalization – The model is primarily evaluated on action‑movie style datasets (UCF‑101, Kinetics). Adapting to highly specialized domains (medical imaging, scientific visualization) may need domain‑specific pre‑training.
- Future Directions – The authors suggest exploring hybrid autoregressive‑diffusion schemes, richer conditioning (text, audio), and more aggressive token compression to push both quality and speed further.
Authors
- Longbin Ji
- Xiaoxiong Liu
- Junyuan Shang
- Shuohuan Wang
- Yu Sun
- Hua Wu
- Haifeng Wang
Paper Information
- arXiv ID: 2601.05966v1
- Categories: cs.CV, cs.AI
- Published: January 9, 2026
- PDF: Download PDF