[Paper] MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives

Published: (December 16, 2025 at 01:59 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.14699v1

Overview

The paper introduces MemFlow, a novel memory‑management system for streaming video generation that keeps long‑range narrative consistency without slowing down inference. By dynamically pulling the most relevant past frames based on the upcoming text prompt, MemFlow lets a video model stay on story while remaining as fast as a memory‑free baseline.

Key Contributions

  • Adaptive memory retrieval: Before each video chunk is generated, MemFlow queries a memory bank with the chunk’s text prompt and fetches the most semantically relevant historic frames.
  • Sparse attention activation: Only the retrieved tokens are attended to during generation, dramatically cutting the compute cost of long‑context attention.
  • Plug‑and‑play design: MemFlow works on top of any streaming video generator that already uses a KV‑cache (e.g., diffusion or autoregressive models).
  • Near‑zero overhead: Experiments show only a 7.9 % slowdown compared with a model that discards all past context, while delivering far superior consistency.
  • Extensive evaluation: The authors benchmark on multiple long‑video datasets, demonstrating both quantitative gains (higher CLIP‑Score, lower FVD) and qualitative improvements in narrative coherence.

Methodology

  1. Memory Bank Construction – As the model streams, each generated frame (or short chunk) is stored together with its visual embeddings and the associated textual prompt.
  2. Prompt‑guided Retrieval – When a new chunk is about to be synthesized, the current prompt is encoded and used to rank the stored embeddings (e.g., via cosine similarity). The top‑K most relevant frames are pulled into a temporary “active memory.”
  3. Sparse Cross‑Attention – In the attention layers of the video generator, queries from the current chunk attend only to tokens from the active memory instead of the full history. This reduces the quadratic cost of attention while preserving the most useful context.
  4. Integration with KV‑Cache – The retrieved tokens are injected into the existing key‑value cache, so the downstream model sees them as if they were part of its normal memory, requiring no architectural changes.

The pipeline repeats for every new chunk, constantly refreshing the active memory to reflect the evolving storyline.

Results & Findings

MetricBaseline (no memory)Fixed‑Strategy MemoryMemFlow
CLIP‑Score (higher = better)0.710.780.84
FVD (lower = better)210165112
Inference slowdown0 %+12 %+7.9 %
Human consistency rating (1‑5)2.83.64.3
  • Narrative coherence improves dramatically, especially when the story introduces new events or switches scenes.
  • Computation grows only marginally because the attention is limited to a small, dynamically chosen subset of frames.
  • The method remains compatible with several backbone generators (e.g., Text‑to‑Video diffusion, autoregressive transformers), confirming its generality.

Practical Implications

  • Content creation platforms (e.g., AI‑driven video editors, game cutscene generators) can now produce hour‑long videos that stay on script without needing massive GPU memory.
  • Real‑time streaming services (live AI avatars, interactive storytelling) benefit from the low latency overhead, enabling smoother user experiences.
  • Developer workflow is simplified: MemFlow is a drop‑in module that wraps around existing models, so teams can upgrade consistency without retraining from scratch.
  • Edge deployment becomes feasible because the memory footprint stays bounded—only the most relevant K frames are kept active at any time.

Limitations & Future Work

  • Retrieval quality depends on the embedding space. If the visual encoder fails to capture subtle semantic nuances, the most “relevant” frames may be suboptimal.
  • Fixed K‑value: The current implementation uses a static number of retrieved frames; adaptive K based on prompt complexity could further improve efficiency.
  • Scalability of the full bank: While active memory is small, the underlying bank still grows linearly with video length; pruning strategies are needed for truly massive streams.
  • Broader modalities: Extending the approach to multimodal inputs (audio, motion capture) and to non‑textual prompts is an open direction.

Overall, MemFlow demonstrates that smart, prompt‑driven memory management can bridge the gap between long‑form narrative fidelity and real‑time performance—an encouraging step for the next generation of AI video generation tools.

Authors

  • Sihui Ji
  • Xi Chen
  • Shuai Yang
  • Xin Tao
  • Pengfei Wan
  • Hengshuang Zhao

Paper Information

  • arXiv ID: 2512.14699v1
  • Categories: cs.CV
  • Published: December 16, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »