[Paper] VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction

Published: (January 9, 2026 at 12:34 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2601.05966v1

Overview

The paper presents VideoAR, a new autoregressive framework for generating videos that rivals the quality of diffusion‑based models while being far more efficient. By combining multi‑scale next‑frame prediction with a visual autoregressive tokenizer, the authors achieve high‑fidelity, temporally coherent video synthesis with dramatically fewer inference steps.

Key Contributions

  • First large‑scale Visual Autoregressive (VAR) video model that jointly handles spatial and temporal dependencies.
  • 3‑D multi‑scale tokenizer that compresses spatio‑temporal information into discrete tokens, enabling fast autoregressive decoding.
  • Temporal stability tricks: Multi‑scale Temporal RoPE, Cross‑Frame Error Correction, and Random Frame Mask to curb error accumulation over long sequences.
  • Multi‑stage pre‑training pipeline that progressively scales resolution and duration, aligning spatial and temporal learning.
  • State‑of‑the‑art results for autoregressive video generation: FVD 88.6 on UCF‑101 (vs. 99.5 previously) and VBench 81.74, with >10× fewer inference steps than diffusion baselines.

Methodology

  1. Tokenization – A 3‑D tokenizer slices a video into a hierarchy of discrete tokens at multiple spatial scales (e.g., 8×8, 16×16 patches) and temporal strides. This compact representation captures both appearance and motion while keeping the sequence length manageable.

  2. Autoregressive Modeling – The model treats video generation as a two‑fold problem:

    • Intra‑frame VAR: predicts the next token within the current frame, preserving spatial structure.
    • Causal next‑frame prediction: forecasts the token set for the upcoming frame, ensuring temporal causality.
  3. Temporal RoPE & Error Correction – Rotary Positional Embeddings (RoPE) are extended across scales to encode relative time, and a lightweight cross‑frame error‑correction module revisits earlier predictions to fix drift.

  4. Training Regimen – A staged curriculum starts with low‑resolution, short‑clip videos, then gradually increases resolution and clip length. Random frame masking forces the model to learn robust reconstruction, further reducing error propagation.

  5. Inference – Generation proceeds token‑by‑token (or block‑by‑block) across frames, but because the token vocabulary is compact, only a handful of decoding steps are needed to produce a full‑length video.

Results & Findings

MetricPrior AutoregressiveVideoARDiffusion (large)
FVD (UCF‑101)99.588.6~85
VBench Score73.281.7482–84
Inference Steps~1000≈90~1000+
Compute (GPU‑hrs)1.2×0.8×1.0× (larger model)
  • VideoAR closes the quality gap with diffusion models while cutting inference time by more than an order of magnitude.
  • The introduced temporal mechanisms significantly reduce flickering and drift, yielding smoother long‑range motion.
  • Ablation studies confirm that each component (Multi‑scale RoPE, Error Correction, Random Mask) contributes measurable gains in FVD and VBench.

Practical Implications

  • Faster Prototyping – Developers can generate high‑quality video samples in seconds on a single GPU, enabling rapid iteration for content creation, game asset pipelines, or synthetic data generation.
  • Scalable Deployment – The token‑based autoregressive design fits well with existing transformer serving stacks (e.g., ONNX, TensorRT), making it easier to integrate into production services compared to memory‑heavy diffusion pipelines.
  • Temporal Consistency – Applications that require coherent motion—such as virtual avatars, video‑to‑video translation, or training data for video‑based perception models—benefit from the reduced error propagation.
  • Resource‑Constrained Environments – Because inference is lightweight, VideoAR can run on edge devices or cloud‑cost‑optimized instances, opening doors for real‑time video synthesis in AR/VR or live streaming contexts.

Limitations & Future Work

  • Resolution Ceiling – While the multi‑scale tokenizer helps, generating ultra‑high‑definition (4K+) videos still strains the token budget and may require further hierarchical designs.
  • Long‑Term Dependencies – Although temporal RoPE and correction mitigate drift, very long clips (>10 seconds) can still exhibit subtle inconsistencies.
  • Domain Generalization – The model is primarily evaluated on action‑movie style datasets (UCF‑101, Kinetics). Adapting to highly specialized domains (medical imaging, scientific visualization) may need domain‑specific pre‑training.
  • Future Directions – The authors suggest exploring hybrid autoregressive‑diffusion schemes, richer conditioning (text, audio), and more aggressive token compression to push both quality and speed further.

Authors

  • Longbin Ji
  • Xiaoxiong Liu
  • Junyuan Shang
  • Shuohuan Wang
  • Yu Sun
  • Hua Wu
  • Haifeng Wang

Paper Information

  • arXiv ID: 2601.05966v1
  • Categories: cs.CV, cs.AI
  • Published: January 9, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »