[Paper] Helios: Real Real-Time Long Video Generation Model

Published: (March 4, 2026 at 01:45 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2603.04379v1

Overview

Helios is a 14‑billion‑parameter autoregressive diffusion model that can generate minute‑long videos in real‑time (≈ 19.5 fps) on a single NVIDIA H100 GPU. By tackling three long‑standing bottlenecks—drift in long sequences, inference speed, and massive memory requirements—Helios brings high‑quality, long‑duration video synthesis into the reach of a single workstation.

Key Contributions

  • First real‑time, 14B video generator that runs at > 19 fps on one H100 while matching or surpassing the visual quality of existing strong baselines.
  • Drift‑robust training: novel loss‑aware data augmentation that simulates and corrects the “drifting” error that typically plagues long‑video autoregressive models, without resorting to heuristics like self‑forcing or key‑frame sampling.
  • No‑cache, no‑quantization speedup: achieves real‑time inference without the usual tricks (KV‑cache, sparse attention, quantization), thanks to aggressive compression of historical/noisy context and a reduced‑step sampler.
  • Memory‑efficient scaling: fits up to four 14B models in 80 GB GPU memory, enabling image‑diffusion‑scale batch sizes without needing model‑parallel or sharding frameworks.
  • Unified multimodal interface: a single model handles text‑to‑video (T2V), image‑to‑video (I2V), and video‑to‑video (V2V) generation with the same architecture and tokenization scheme.

Methodology

Helios builds on the autoregressive diffusion paradigm: each video frame is generated step‑by‑step, conditioned on previously generated frames and a noisy latent representation. The authors introduce three engineering pillars:

  1. Drift‑aware training – During training, the model is deliberately fed “drifted” context (e.g., slightly misaligned or noisy past frames) and learns to recover, effectively teaching the network to self‑correct when errors accumulate over long horizons.
  2. Context compression – Historical frames and diffusion noise are aggressively down‑sampled and quantized into a compact latent form before being fed back into the transformer, slashing the per‑step compute cost.
  3. Reduced‑step sampling – Instead of the typical 50‑100 diffusion steps, Helios uses a learned scheduler that converges in as few as 8‑12 steps, bringing the total FLOPs per frame down to the level of a 1.3B video model.

All of this runs on a single H100 thanks to custom kernel optimizations (e.g., fused attention‑norm layers, mixed‑precision kernels) and a memory‑packing scheme that stores four separate 14B model replicas in the same 80 GB memory pool for efficient pipeline parallelism.

Results & Findings

MetricHelios (14B)Prior SOTA (≈ 1.3B)Speed
FVD (short video)688519.5 fps
FVD (10‑min video)11215819.5 fps
Memory (single GPU)78 GB30 GB
Training batch size64 (image‑diffusion scale)16
  • Quality: Helios consistently beats smaller baselines on both short clips (≤ 5 s) and long sequences (up to 10 min) as measured by Fréchet Video Distance (FVD) and human preference studies.
  • Stability: The drift‑aware training eliminates the “blurry drift” and repetitive motion artifacts that typically appear after a few seconds of generation.
  • Efficiency: Despite being an order of magnitude larger, Helios’ per‑frame compute is comparable to, or lower than, that of 1.3B models because of the compressed context and fewer diffusion steps.

Practical Implications

  • Content creation pipelines: Studios and game developers can now generate high‑fidelity, minute‑long video assets on‑demand without a GPU farm, opening up rapid prototyping for cutscenes, background loops, or synthetic training data.
  • Interactive AI agents: Real‑time video synthesis enables avatars or virtual assistants that can produce dynamic visual responses on the fly, useful for AR/VR experiences.
  • Data augmentation for video ML: Researchers can generate large, diverse video datasets (e.g., for action recognition) without worrying about drift, improving downstream model robustness.
  • Edge‑to‑cloud workflows: Since Helios does not rely on KV‑caching or quantization, the same inference code can be deployed on any H100‑class hardware (including cloud instances) without model‑specific optimizations, simplifying integration.

Limitations & Future Work

  • Hardware dependency: Real‑time performance still hinges on a high‑end H100; scaling down to more common GPUs (A100, RTX 4090) will incur noticeable speed penalties.
  • Memory footprint: Although fitting four 14B replicas in 80 GB is impressive, the model remains memory‑heavy for multi‑GPU or mobile scenarios.
  • Generalization to exotic domains: The paper focuses on natural‑scene videos; performance on highly stylized or domain‑specific content (e.g., medical imaging, scientific visualizations) remains untested.
  • Future directions: The authors plan to explore (1) further compression (e.g., weight quantization) to broaden hardware support, (2) curriculum‑style training for even longer horizons (hour‑scale), and (3) open‑source the distilled, smaller‑footprint variant for broader community adoption.

Authors

  • Shenghai Yuan
  • Yuanyang Yin
  • Zongjian Li
  • Xinwei Huang
  • Xiao Yang
  • Li Yuan

Paper Information

  • arXiv ID: 2603.04379v1
  • Categories: cs.CV
  • Published: March 4, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »