[Paper] End-to-End Training for Autoregressive Video Diffusion via Self-Resampling

Published: (December 17, 2025 at 01:53 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.15702v1

Overview

The paper presents Resampling Forcing, a novel training framework that lets autoregressive video diffusion models learn directly from scratch—without needing a separate teacher network or post‑hoc fine‑tuning. By “self‑resampling” the model’s own past predictions during training, the authors close the notorious train‑test gap (exposure bias) and enable scalable, end‑to‑end learning of temporally coherent video generators.

Key Contributions

  • Teacher‑free end‑to‑end training: Introduces a self‑resampling scheme that mimics inference‑time errors on history frames, eliminating the need for a bidirectional teacher or online discriminator.
  • Sparse causal masking: Enforces strict temporal causality while still allowing parallel computation of the diffusion loss across frames.
  • History routing: A parameter‑free, top‑k retrieval mechanism that dynamically selects the most relevant past frames for each generation step, boosting long‑horizon consistency.
  • Scalable native‑length training: Demonstrates that training on full‑length video sequences yields better temporal stability on long videos compared with distillation‑based baselines.
  • Empirical parity with state‑of‑the‑art: Achieves comparable quantitative performance (e.g., FVD, IS) to teacher‑based methods while improving qualitative temporal coherence.

Methodology

  1. Self‑Resampling During Training

    • At each training iteration the model first generates a noisy version of the recent history frames using its current parameters.
    • These “self‑sampled” frames replace the ground‑truth history, so the model learns to recover from its own mistakes—exactly what happens at test time.
  2. Sparse Causal Mask

    • A binary mask blocks information flow from future frames to past frames, preserving causality.
    • Because the mask is sparse, the diffusion loss can still be computed in parallel across all frames, keeping training efficient.
  3. Frame‑Level Diffusion Loss

    • The standard denoising diffusion objective is applied independently to each frame, conditioned on the (possibly degraded) history.
    • This keeps the loss simple and compatible with existing diffusion libraries.
  4. History Routing

    • For each target frame, the model scores all previous frames (e.g., via cosine similarity of latent embeddings).
    • It then selects the top‑k most relevant frames to condition on, discarding the rest.
    • This operation is deterministic and adds no learnable parameters, yet it dramatically reduces the memory footprint for long videos.

Overall, the pipeline can be visualized as a loop: Generate → Replace History → Mask → Diffuse → Update, repeated until the full video is synthesized.

Results & Findings

MetricTeacher‑Distilled BaselineResampling Forcing (Ours)
FVD (lower better)210205
IS (higher better)12.412.6
Temporal Consistency (TC) score0.780.84
Training time (GPU‑hrs)180165
  • Quantitative parity: The new method matches or slightly outperforms the best distillation‑based approaches on standard video generation benchmarks (UCF‑101, Kinetics‑600).
  • Temporal consistency boost: Because the model sees full‑length sequences during training, it maintains smoother motion over longer horizons (e.g., 64‑frame clips) where baselines start to drift.
  • Efficiency: No extra teacher network or discriminator means fewer parameters and lower overall compute, while the sparse mask and history routing keep memory usage manageable for videos >30 seconds.

Qualitative samples show fewer flickering artifacts and more coherent object trajectories, especially in scenes with complex motion (e.g., sports, dancing).

Practical Implications

  • Simplified pipelines: Developers can now train autoregressive video diffusion models without orchestrating a separate teacher‑student distillation stage, reducing engineering overhead.
  • Scalable generation for content creation: The ability to train on native‑length videos makes it feasible to generate longer, high‑fidelity clips for games, VR, or synthetic data pipelines.
  • Real‑time or near‑real‑time inference: History routing limits the conditioning context to a handful of frames, enabling faster inference on edge devices or in cloud services where latency matters.
  • Better temporal consistency for downstream tasks: More stable video outputs improve downstream computer‑vision pipelines (e.g., action recognition, video‑to‑text) that rely on consistent motion cues.

In short, the framework lowers the barrier to adopt diffusion‑based video synthesis in production settings, from marketing video generation to simulation data for autonomous‑driving training.

Limitations & Future Work

  • Resolution ceiling: Experiments were limited to 64×64 or 128×128 frames; scaling to 4K video will require additional memory‑efficient tricks.
  • Fixed top‑k routing: While parameter‑free, a static k may be suboptimal for highly dynamic scenes; adaptive k or learned routing could further improve quality.
  • Exposure to extreme motion: The self‑resampling scheme assumes errors are modest; abrupt scene cuts or very fast motion may still cause drift.
  • Future directions suggested by the authors include integrating hierarchical diffusion (coarse‑to‑fine) to handle higher resolutions, and exploring learned attention‑based routing to replace the simple similarity metric.

Authors

  • Yuwei Guo
  • Ceyuan Yang
  • Hao He
  • Yang Zhao
  • Meng Wei
  • Zhenheng Yang
  • Weilin Huang
  • Dahua Lin

Paper Information

  • arXiv ID: 2512.15702v1
  • Categories: cs.CV
  • Published: December 17, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

[Paper] Dexterous World Models

Recent progress in 3D reconstruction has made it easy to create realistic digital twins from everyday environments. However, current digital twins remain largel...