[Paper] AdaState: Self-Evolving Anchors for Streaming Video Generation

Published: 1 week ago (May 28, 2026 at 01:59 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.30349v1

Overview

The paper AdaState: Self‑Evolving Anchors for Streaming Video Generation tackles a subtle but pervasive problem in autoregressive video diffusion models: the first frame is treated as a permanent “anchor” that dominates the attention cache, which ends up freezing the scene and muting motion. By replacing this static anchor with a learnable, hidden state that evolves together with the generated frames, the authors enable a more dynamic, camera‑aware video generation pipeline that still works in a streaming (chunk‑by‑chunk) fashion.

Key Contributions

Adaptive hidden state (“AdaState”) that is denoised alongside each video chunk, serving as a moving reference instead of a frozen first‑frame anchor.
Relative‑time formulation: the same positional structure is reused at every generation step, making the transition function time‑invariant and introducing a natural recurrence.
Cache‑only recurrence: the KV (key‑value) cache of the transformer doubles as the carrier of the evolving state, eliminating the need for any external recurrent module.
Empirical validation showing markedly richer motion, camera pans, and scene evolution on standard video diffusion benchmarks.
Conceptual bridge between diffusion‑based video generation and classic recurrent models, opening a new design space for streaming generative systems.

Methodology

Baseline architecture – The authors start from a typical autoregressive video diffusion model that processes video in fixed‑size chunks. The first frame’s latent is stored in the transformer’s KV cache and is repeatedly attended to, acting as a static scene anchor.
Introducing AdaState – Instead of a fixed first‑frame latent, a learnable hidden vector (the “state”) is inserted into the cache at every step. This state is not rendered; it is only used as a contextual cue for the next chunk.
Joint denoising – During each diffusion step, the model simultaneously denoises the visible content (the current chunk) and the hidden state. The state update function is the same diffusion transition used for the visible frames, ensuring a unified training objective.
Relative positional encoding – Positional embeddings are defined relative to the current chunk rather than absolute time indices. Consequently, the model sees the same positional pattern regardless of how many chunks have already been generated, making the state transition time‑invariant.
Training & inference – The model is trained end‑to‑end with the usual diffusion loss, but the loss is also back‑propagated through the hidden state updates. At inference time, the hidden state is simply carried forward in the KV cache, requiring no extra memory or compute beyond the existing transformer cache.

Results & Findings

Motion quality – Quantitative metrics (e.g., FVD, LPIPS over time) improve by 15‑25 % compared to the static‑anchor baseline, indicating more realistic motion trajectories.
Camera dynamics – Visual inspection shows smoother pans and zooms; the model no longer “locks” to the initial viewpoint.
Temporal consistency – Despite the evolving anchor, frame‑to‑frame coherence remains high, demonstrating that the adaptive state does not sacrifice stability.
Ablation studies – Removing the relative‑time encoding or training the state without joint denoising degrades performance back to baseline levels, confirming the importance of both components.

Practical Implications

Streaming content creation – Developers building real‑time video generation tools (e.g., virtual avatars, live‑stream overlays) can now generate longer, more dynamic clips without needing to pre‑compute the entire sequence.
Game engines & VR – The recurrence‑based approach fits naturally into existing transformer‑cache pipelines used for procedural content generation, enabling on‑the‑fly scene evolution that respects camera motion.
Low‑latency pipelines – Because AdaState lives inside the KV cache, there’s no extra model or memory overhead, making it suitable for edge devices or cloud‑GPU inference where latency is critical.
Hybrid generative systems – The relative‑time, state‑transition view opens the door to mixing diffusion with classic RNN‑style controllers (e.g., for user‑guided camera paths) without architectural gymnastics.

Limitations & Future Work

Hidden state interpretability – The adaptive state is a black‑box latent; understanding what aspects of the scene it encodes remains an open question.
Scalability to very long videos – While the cache recurrence is efficient, extremely long sequences may still hit memory limits; hierarchical caching strategies could be explored.
Generalization to multimodal conditioning – The current work focuses on unconditional generation; extending AdaState to text‑to‑video or audio‑driven scenarios is a natural next step.
Benchmark diversity – Experiments were conducted on a handful of standard diffusion datasets; testing on higher‑resolution, domain‑specific video (e.g., medical imaging, autonomous‑driving footage) would further validate robustness.

AdaState demonstrates that a modest change—making the scene anchor a learnable, evolving hidden state—can unlock a whole new level of dynamism in streaming video diffusion models, offering developers a practical path toward richer, more interactive generative video applications.

Authors

Yusuf Dalva
Pinar Yanardag

Paper Information

arXiv ID: 2605.30349v1
Categories: cs.CV
Published: May 28, 2026
PDF: Download PDF

[Paper] AdaState: Self-Evolving Anchors for Streaming Video Generation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

[Paper] KLIP: localized distribution shift detection via KL-divergence with diffusion priors in Inverse Problems

[Paper] TunerDiT: Training-free Progressive Steering of Diffusion Transformer for Multi-Event Video Generation

[Paper] Vision-Language Models Suppress Female Representations Under Ambiguous Input