[Paper] MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives

Published: 1 month ago (December 16, 2025 at 01:59 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.14699v1

Overview

The paper introduces MemFlow, a novel memory‑management system for streaming video generation that keeps long‑range narrative consistency without slowing down inference. By dynamically pulling the most relevant past frames based on the upcoming text prompt, MemFlow lets a video model stay on story while remaining as fast as a memory‑free baseline.

Key Contributions

Adaptive memory retrieval: Before each video chunk is generated, MemFlow queries a memory bank with the chunk’s text prompt and fetches the most semantically relevant historic frames.
Sparse attention activation: Only the retrieved tokens are attended to during generation, dramatically cutting the compute cost of long‑context attention.
Plug‑and‑play design: MemFlow works on top of any streaming video generator that already uses a KV‑cache (e.g., diffusion or autoregressive models).
Near‑zero overhead: Experiments show only a 7.9 % slowdown compared with a model that discards all past context, while delivering far superior consistency.
Extensive evaluation: The authors benchmark on multiple long‑video datasets, demonstrating both quantitative gains (higher CLIP‑Score, lower FVD) and qualitative improvements in narrative coherence.

Methodology

Memory Bank Construction – As the model streams, each generated frame (or short chunk) is stored together with its visual embeddings and the associated textual prompt.
Prompt‑guided Retrieval – When a new chunk is about to be synthesized, the current prompt is encoded and used to rank the stored embeddings (e.g., via cosine similarity). The top‑K most relevant frames are pulled into a temporary “active memory.”
Sparse Cross‑Attention – In the attention layers of the video generator, queries from the current chunk attend only to tokens from the active memory instead of the full history. This reduces the quadratic cost of attention while preserving the most useful context.
Integration with KV‑Cache – The retrieved tokens are injected into the existing key‑value cache, so the downstream model sees them as if they were part of its normal memory, requiring no architectural changes.

The pipeline repeats for every new chunk, constantly refreshing the active memory to reflect the evolving storyline.

Results & Findings

Metric	Baseline (no memory)	Fixed‑Strategy Memory	MemFlow
CLIP‑Score (higher = better)	0.71	0.78	0.84
FVD (lower = better)	210	165	112
Inference slowdown	0 %	+12 %	+7.9 %
Human consistency rating (1‑5)	2.8	3.6	4.3

Narrative coherence improves dramatically, especially when the story introduces new events or switches scenes.
Computation grows only marginally because the attention is limited to a small, dynamically chosen subset of frames.
The method remains compatible with several backbone generators (e.g., Text‑to‑Video diffusion, autoregressive transformers), confirming its generality.

Practical Implications

Content creation platforms (e.g., AI‑driven video editors, game cutscene generators) can now produce hour‑long videos that stay on script without needing massive GPU memory.
Real‑time streaming services (live AI avatars, interactive storytelling) benefit from the low latency overhead, enabling smoother user experiences.
Developer workflow is simplified: MemFlow is a drop‑in module that wraps around existing models, so teams can upgrade consistency without retraining from scratch.
Edge deployment becomes feasible because the memory footprint stays bounded—only the most relevant K frames are kept active at any time.

Limitations & Future Work

Retrieval quality depends on the embedding space. If the visual encoder fails to capture subtle semantic nuances, the most “relevant” frames may be suboptimal.
Fixed K‑value: The current implementation uses a static number of retrieved frames; adaptive K based on prompt complexity could further improve efficiency.
Scalability of the full bank: While active memory is small, the underlying bank still grows linearly with video length; pruning strategies are needed for truly massive streams.
Broader modalities: Extending the approach to multimodal inputs (audio, motion capture) and to non‑textual prompts is an open direction.

Overall, MemFlow demonstrates that smart, prompt‑driven memory management can bridge the gap between long‑form narrative fidelity and real‑time performance—an encouraging step for the next generation of AI video generation tools.

Authors

Sihui Ji
Xi Chen
Shuai Yang
Xin Tao
Pengfei Wan
Hengshuang Zhao

Paper Information

arXiv ID: 2512.14699v1
Categories: cs.CV
Published: December 16, 2025
PDF: Download PDF

[Paper] MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

[Paper] Dexterous World Models

[Paper] Adversarial Robustness of Vision in Open Foundation Models