[Paper] VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction

Published: 1 month ago (January 9, 2026 at 12:34 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2601.05966v1

Overview

The paper presents VideoAR, a new autoregressive framework for generating videos that rivals the quality of diffusion‑based models while being far more efficient. By combining multi‑scale next‑frame prediction with a visual autoregressive tokenizer, the authors achieve high‑fidelity, temporally coherent video synthesis with dramatically fewer inference steps.

Key Contributions

First large‑scale Visual Autoregressive (VAR) video model that jointly handles spatial and temporal dependencies.
3‑D multi‑scale tokenizer that compresses spatio‑temporal information into discrete tokens, enabling fast autoregressive decoding.
Temporal stability tricks: Multi‑scale Temporal RoPE, Cross‑Frame Error Correction, and Random Frame Mask to curb error accumulation over long sequences.
Multi‑stage pre‑training pipeline that progressively scales resolution and duration, aligning spatial and temporal learning.
State‑of‑the‑art results for autoregressive video generation: FVD 88.6 on UCF‑101 (vs. 99.5 previously) and VBench 81.74, with >10× fewer inference steps than diffusion baselines.

Methodology

Tokenization – A 3‑D tokenizer slices a video into a hierarchy of discrete tokens at multiple spatial scales (e.g., 8×8, 16×16 patches) and temporal strides. This compact representation captures both appearance and motion while keeping the sequence length manageable.
Autoregressive Modeling – The model treats video generation as a two‑fold problem:
- Intra‑frame VAR: predicts the next token within the current frame, preserving spatial structure.
- Causal next‑frame prediction: forecasts the token set for the upcoming frame, ensuring temporal causality.
Temporal RoPE & Error Correction – Rotary Positional Embeddings (RoPE) are extended across scales to encode relative time, and a lightweight cross‑frame error‑correction module revisits earlier predictions to fix drift.
Training Regimen – A staged curriculum starts with low‑resolution, short‑clip videos, then gradually increases resolution and clip length. Random frame masking forces the model to learn robust reconstruction, further reducing error propagation.
Inference – Generation proceeds token‑by‑token (or block‑by‑block) across frames, but because the token vocabulary is compact, only a handful of decoding steps are needed to produce a full‑length video.

Results & Findings

Metric	Prior Autoregressive	VideoAR	Diffusion (large)
FVD (UCF‑101)	99.5	88.6	~85
VBench Score	73.2	81.74	82–84
Inference Steps	~1000	≈90	~1000+
Compute (GPU‑hrs)	1.2×	0.8×	1.0× (larger model)

VideoAR closes the quality gap with diffusion models while cutting inference time by more than an order of magnitude.
The introduced temporal mechanisms significantly reduce flickering and drift, yielding smoother long‑range motion.
Ablation studies confirm that each component (Multi‑scale RoPE, Error Correction, Random Mask) contributes measurable gains in FVD and VBench.

Practical Implications

Faster Prototyping – Developers can generate high‑quality video samples in seconds on a single GPU, enabling rapid iteration for content creation, game asset pipelines, or synthetic data generation.
Scalable Deployment – The token‑based autoregressive design fits well with existing transformer serving stacks (e.g., ONNX, TensorRT), making it easier to integrate into production services compared to memory‑heavy diffusion pipelines.
Temporal Consistency – Applications that require coherent motion—such as virtual avatars, video‑to‑video translation, or training data for video‑based perception models—benefit from the reduced error propagation.
Resource‑Constrained Environments – Because inference is lightweight, VideoAR can run on edge devices or cloud‑cost‑optimized instances, opening doors for real‑time video synthesis in AR/VR or live streaming contexts.

Limitations & Future Work

Resolution Ceiling – While the multi‑scale tokenizer helps, generating ultra‑high‑definition (4K+) videos still strains the token budget and may require further hierarchical designs.
Long‑Term Dependencies – Although temporal RoPE and correction mitigate drift, very long clips (>10 seconds) can still exhibit subtle inconsistencies.
Domain Generalization – The model is primarily evaluated on action‑movie style datasets (UCF‑101, Kinetics). Adapting to highly specialized domains (medical imaging, scientific visualization) may need domain‑specific pre‑training.
Future Directions – The authors suggest exploring hybrid autoregressive‑diffusion schemes, richer conditioning (text, audio), and more aggressive token compression to push both quality and speed further.

Authors

Longbin Ji
Xiaoxiong Liu
Junyuan Shang
Shuohuan Wang
Yu Sun
Hua Wu
Haifeng Wang

Paper Information

arXiv ID: 2601.05966v1
Categories: cs.CV, cs.AI
Published: January 9, 2026
PDF: Download PDF

[Paper] VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Performance of a Deep Learning-Based Segmentation Model for Pancreatic Tumors on Public Endoscopic Ultrasound Datasets

[Paper] LayerGS: Decomposition and Inpainting of Layered 3D Human Avatars via 2D Gaussian Splatting

[Paper] RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation

[Paper] Learning Latent Action World Models In The Wild