[Paper] Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory

Published: 3 days ago (February 20, 2026 at 01:59 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.18434v1

Overview

The paper introduces MemStream, a new approach for understanding continuous video streams—think live‑stream analytics or long‑form video question answering. By dramatically expanding the number of tokens a model can keep in its key‑value (KV) cache and adding smart token‑selection tricks, the authors show that you can retain far more fine‑grained visual detail without blowing up memory or latency.

Key Contributions

Token‑budget scaling: Demonstrates that increasing the KV‑cache token budget (from a few dozen to several hundred per frame) yields richer spatiotemporal representations.
Adaptive token reduction: Proposes a lightweight, data‑driven selector that prunes redundant tokens while preserving local context, keeping memory usage in check.
Training‑free retrieval Mixture‑of‑Experts (MoE): Leverages off‑the‑shelf vision‑language models as “experts” to score frame relevance, improving retrieval without extra training.
Bias mitigation: Identifies and corrects the tendency of existing caches to over‑favor recent frames, ensuring older but still relevant content isn’t ignored.
State‑of‑the‑art gains: Achieves +8.0 % on CG‑Bench, +8.5 % on LVBench, and +2.4 % on VideoMME (Long) compared with the strong ReKV baseline using the Qwen2.5‑VL‑7B model.

Methodology

Baseline architecture (ReKV):
- Frames are encoded into visual tokens.
- Tokens are stored in a KV‑cache; a query (e.g., a VQA question) attends over the whole cache to retrieve relevant information.
Problem discovered:
- As the stream grows, similarity scores between the query and cached frames drift upward, causing the model to favor the newest frames and ignore earlier context.
Adaptive token selection:
- For each incoming frame, compute a redundancy score (e.g., cosine similarity among tokens).
- Keep only the most informative tokens (those that add new spatial‑temporal cues) while discarding near‑duplicates.
- This keeps the token count per frame high enough for detail but low enough to stay within GPU memory limits.
Retrieval Mixture‑of‑Experts (MoE):
- Instead of training a dedicated relevance scorer, the system queries several pre‑trained vision‑language models (e.g., CLIP, BLIP) in parallel.
- Their similarity outputs are combined (weighted voting) to produce a robust relevance estimate for each cached frame.
- No extra gradient updates are required—hence “training‑free.”
Bias correction:
- Apply a temporal decay factor to similarity scores, counteracting the natural increase over time.
- The decay is learned implicitly through the MoE’s diverse perspectives, ensuring older frames can still surface when truly relevant.
End‑to‑end pipeline:
- Stream → Frame encoder → Adaptive token selector → KV‑cache (scaled) → MoE retrieval → Answer decoder (Qwen2.5‑VL‑7B).

Results & Findings

Benchmark	Baseline (ReKV + Qwen2.5‑VL‑7B)	MemStream (+8 % avg.)
CG‑Bench	62.1 %	70.1 % (+8.0)
LVBench	55.3 %	63.8 % (+8.5)
VideoMME (Long)	48.7 %	51.1 % (+2.4)

Token scaling alone gave a ~3–4 % boost, confirming that fine‑grained visual detail matters.
Adaptive selection recovered most of the memory savings (≈ 30 % fewer tokens stored) while preserving the accuracy gains.
MoE retrieval contributed the remaining jump, especially on benchmarks with diverse visual domains.

Qualitative examples show MemStream correctly answering questions about objects that appear early in a 10‑minute stream—something the baseline missed because its cache had drifted toward later frames.

Practical Implications

Live video analytics: Companies building moderation or event‑detection pipelines can now keep richer context from hours‑long streams without needing massive GPU memory.
Interactive VQA bots: Chat‑style assistants that answer questions about a user’s ongoing video call or livestream can reference older moments more reliably.
Edge deployment: The adaptive token selector reduces the raw token count, making it feasible to run MemStream on consumer‑grade GPUs (e.g., RTX 3060) for real‑time applications.
Plug‑and‑play MoE: Since the retrieval experts are off‑the‑shelf models, teams can swap in domain‑specific encoders (e.g., medical imaging models) without retraining the whole system.

Overall, MemStream bridges the gap between high‑fidelity visual memory and practical compute budgets, opening the door for more sophisticated temporal reasoning in production AI products.

Limitations & Future Work

Memory ceiling: Even with adaptive pruning, extremely long streams (multi‑hour) still approach GPU memory limits; hierarchical caching (e.g., summarizing older segments) is left for future exploration.
Expert selection overhead: Querying multiple external models adds latency; smarter caching of expert scores or distilling the MoE into a single lightweight scorer could mitigate this.
Domain generalization: The experiments focus on general‑purpose benchmarks; performance on highly specialized video domains (e.g., autonomous driving) remains untested.
Temporal decay tuning: The decay factor is heuristic; learning a more principled, data‑driven decay could further reduce bias toward recent frames.

The authors suggest extending MemStream with learnable token‑selection policies and multi‑modal memory (audio + text) as promising next steps.

Authors

Vatsal Agarwal
Saksham Suri
Matthew Gwilliam
Pulkit Kumar
Abhinav Shrivastava

Paper Information

arXiv ID: 2602.18434v1
Categories: cs.CV
Published: February 20, 2026
PDF: Download PDF

[Paper] Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] SARAH: Spatially Aware Real-time Agentic Humans

[Paper] The Geometry of Noise: Why Diffusion Models Don't Need Noise Conditioning

[Paper] Spatio-Spectroscopic Representation Learning using Unsupervised Convolutional Long-Short Term Memory Networks

[Paper] CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation