[Paper] Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory

Published: (February 20, 2026 at 01:59 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.18434v1

Overview

The paper introduces MemStream, a new approach for understanding continuous video streams—think live‑stream analytics or long‑form video question answering. By dramatically expanding the number of tokens a model can keep in its key‑value (KV) cache and adding smart token‑selection tricks, the authors show that you can retain far more fine‑grained visual detail without blowing up memory or latency.

Key Contributions

  • Token‑budget scaling: Demonstrates that increasing the KV‑cache token budget (from a few dozen to several hundred per frame) yields richer spatiotemporal representations.
  • Adaptive token reduction: Proposes a lightweight, data‑driven selector that prunes redundant tokens while preserving local context, keeping memory usage in check.
  • Training‑free retrieval Mixture‑of‑Experts (MoE): Leverages off‑the‑shelf vision‑language models as “experts” to score frame relevance, improving retrieval without extra training.
  • Bias mitigation: Identifies and corrects the tendency of existing caches to over‑favor recent frames, ensuring older but still relevant content isn’t ignored.
  • State‑of‑the‑art gains: Achieves +8.0 % on CG‑Bench, +8.5 % on LVBench, and +2.4 % on VideoMME (Long) compared with the strong ReKV baseline using the Qwen2.5‑VL‑7B model.

Methodology

  1. Baseline architecture (ReKV):

    • Frames are encoded into visual tokens.
    • Tokens are stored in a KV‑cache; a query (e.g., a VQA question) attends over the whole cache to retrieve relevant information.
  2. Problem discovered:

    • As the stream grows, similarity scores between the query and cached frames drift upward, causing the model to favor the newest frames and ignore earlier context.
  3. Adaptive token selection:

    • For each incoming frame, compute a redundancy score (e.g., cosine similarity among tokens).
    • Keep only the most informative tokens (those that add new spatial‑temporal cues) while discarding near‑duplicates.
    • This keeps the token count per frame high enough for detail but low enough to stay within GPU memory limits.
  4. Retrieval Mixture‑of‑Experts (MoE):

    • Instead of training a dedicated relevance scorer, the system queries several pre‑trained vision‑language models (e.g., CLIP, BLIP) in parallel.
    • Their similarity outputs are combined (weighted voting) to produce a robust relevance estimate for each cached frame.
    • No extra gradient updates are required—hence “training‑free.”
  5. Bias correction:

    • Apply a temporal decay factor to similarity scores, counteracting the natural increase over time.
    • The decay is learned implicitly through the MoE’s diverse perspectives, ensuring older frames can still surface when truly relevant.
  6. End‑to‑end pipeline:

    • Stream → Frame encoder → Adaptive token selector → KV‑cache (scaled) → MoE retrieval → Answer decoder (Qwen2.5‑VL‑7B).

Results & Findings

BenchmarkBaseline (ReKV + Qwen2.5‑VL‑7B)MemStream (+8 % avg.)
CG‑Bench62.1 %70.1 % (+8.0)
LVBench55.3 %63.8 % (+8.5)
VideoMME (Long)48.7 %51.1 % (+2.4)
  • Token scaling alone gave a ~3–4 % boost, confirming that fine‑grained visual detail matters.
  • Adaptive selection recovered most of the memory savings (≈ 30 % fewer tokens stored) while preserving the accuracy gains.
  • MoE retrieval contributed the remaining jump, especially on benchmarks with diverse visual domains.

Qualitative examples show MemStream correctly answering questions about objects that appear early in a 10‑minute stream—something the baseline missed because its cache had drifted toward later frames.

Practical Implications

  • Live video analytics: Companies building moderation or event‑detection pipelines can now keep richer context from hours‑long streams without needing massive GPU memory.
  • Interactive VQA bots: Chat‑style assistants that answer questions about a user’s ongoing video call or livestream can reference older moments more reliably.
  • Edge deployment: The adaptive token selector reduces the raw token count, making it feasible to run MemStream on consumer‑grade GPUs (e.g., RTX 3060) for real‑time applications.
  • Plug‑and‑play MoE: Since the retrieval experts are off‑the‑shelf models, teams can swap in domain‑specific encoders (e.g., medical imaging models) without retraining the whole system.

Overall, MemStream bridges the gap between high‑fidelity visual memory and practical compute budgets, opening the door for more sophisticated temporal reasoning in production AI products.

Limitations & Future Work

  • Memory ceiling: Even with adaptive pruning, extremely long streams (multi‑hour) still approach GPU memory limits; hierarchical caching (e.g., summarizing older segments) is left for future exploration.
  • Expert selection overhead: Querying multiple external models adds latency; smarter caching of expert scores or distilling the MoE into a single lightweight scorer could mitigate this.
  • Domain generalization: The experiments focus on general‑purpose benchmarks; performance on highly specialized video domains (e.g., autonomous driving) remains untested.
  • Temporal decay tuning: The decay factor is heuristic; learning a more principled, data‑driven decay could further reduce bias toward recent frames.

The authors suggest extending MemStream with learnable token‑selection policies and multi‑modal memory (audio + text) as promising next steps.

Authors

  • Vatsal Agarwal
  • Saksham Suri
  • Matthew Gwilliam
  • Pulkit Kumar
  • Abhinav Shrivastava

Paper Information

  • arXiv ID: 2602.18434v1
  • Categories: cs.CV
  • Published: February 20, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »