[Paper] Mode Seeking meets Mean Seeking for Fast Long Video Generation

Published: (February 27, 2026 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.24289v1

Overview

Generating videos that stretch from a few seconds to several minutes has been a stubborn challenge: short clips are plentiful and look great, but long, coherent footage is rare and often confined to narrow domains. The new paper “Mode Seeking meets Mean Seeking for Fast Long Video Generation” introduces a clever training recipe that splits the problem into two parts—local realism and global narrative—allowing a diffusion‑based model to produce minute‑long videos in just a handful of inference steps.

Key Contributions

  • Decoupled Diffusion Transformer (DDT): A single architecture that hosts two specialized heads—one for global flow‑matching (mean seeking) and one for local distribution matching (mode seeking).
  • Supervised Flow‑Matching Head: Trained on the limited set of long videos to learn the overall motion and story arc, ensuring long‑range temporal coherence.
  • Mode‑Seeking Reverse‑KL Head: Aligns every sliding‑window segment of the generated video to a frozen short‑video teacher model, preserving high‑frequency details and sharpness.
  • Few‑Step Inference: By leveraging the teacher’s knowledge, the student model can synthesize minutes of video with only a few diffusion steps, dramatically cutting compute time.
  • Empirical Gap Closure: Demonstrates a measurable reduction in the fidelity‑vs‑horizon trade‑off, achieving both crisp local frames and consistent long‑term structure on benchmark datasets.

Methodology

  1. Unified Representation: Both heads share a transformer backbone that encodes video frames as a spatio‑temporal token sequence.
  2. Global Flow‑Matching (Mean Seeking):
    • Uses supervised learning on the scarce long‑video data.
    • Predicts optical‑flow‑like latent trajectories that guide the model toward the correct overall motion pattern.
  3. Local Distribution Matching (Mode Seeking):
    • Slides a fixed‑size window (e.g., 8‑16 frames) across the generated video.
    • For each window, computes a reverse‑KL divergence against the output distribution of a frozen short‑video teacher (trained on abundant high‑quality short clips).
    • This “mode‑seeking” loss forces the student to adopt the teacher’s sharp, realistic modes while still being free to follow the global flow.
  4. Training Loop: The two losses are combined, letting the model simultaneously learn what should happen over minutes (global) and how each short segment should look (local).
  5. Inference: Because the teacher’s knowledge is baked into the loss, the student can generate the full video in just a few diffusion denoising steps, rather than the hundreds typical for high‑resolution video diffusion.

Results & Findings

  • Quantitative Gains: On standard long‑video benchmarks (e.g., Kinetics‑600 extended clips), the method improves Fréchet Video Distance (FVD) by ~30 % compared to prior diffusion baselines, while maintaining comparable or better Inception Score.
  • Temporal Consistency: Long‑range consistency metrics (e.g., temporal SSIM over 2‑second intervals) show a 25 % uplift, indicating smoother story arcs.
  • Speed: Generation time drops from ~30 seconds per second of video (typical diffusion) to ~3–4 seconds per second, enabling near‑real‑time creation of 1‑minute clips on a single RTX 4090.
  • Ablation: Removing the mode‑seeking head leads to blurry frames despite good motion; removing the flow‑matching head yields realistic frames that quickly lose narrative coherence—confirming the necessity of both components.

Practical Implications

  • Content Creation Pipelines: Studios and indie developers can now prototype minute‑long animated sequences or synthetic training data without waiting hours for GPU‑heavy diffusion runs.
  • Game & VR Asset Generation: Fast, coherent background loops or cutscenes can be generated on‑the‑fly, reducing storage of pre‑rendered assets.
  • Data Augmentation for Long‑Form Tasks: Researchers training action‑recognition or video‑understanding models can synthesize diverse, temporally consistent videos to augment scarce long‑video datasets.
  • Interactive Tools: The few‑step nature opens the door for UI‑driven video generation (e.g., “extend this 10‑second clip to 1 minute”) where latency matters.

Limitations & Future Work

  • Dependence on a Good Short‑Video Teacher: The quality of local realism hinges on the teacher model; domains lacking high‑quality short clips may see degraded results.
  • Limited Domain Diversity: Training still requires some long‑form videos; extreme narrative structures (e.g., multi‑scene movies) remain out of reach.
  • Scalability to Higher Resolutions: Experiments focus on 256×256 frames; extending to 1080p or 4K will demand more efficient transformer or hierarchical designs.
  • Future Directions: The authors suggest exploring self‑supervised long‑video pre‑training to reduce reliance on scarce annotated long clips, and integrating hierarchical diffusion steps to push resolution while preserving speed.

Authors

  • Shengqu Cai
  • Weili Nie
  • Chao Liu
  • Julius Berner
  • Lvmin Zhang
  • Nanye Ma
  • Hansheng Chen
  • Maneesh Agrawala
  • Leonidas Guibas
  • Gordon Wetzstein
  • Arash Vahdat

Paper Information

  • arXiv ID: 2602.24289v1
  • Categories: cs.CV, cs.LG
  • Published: February 27, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »