[Paper] WeaveTime: Stream from Earlier Frames into Emergent Memory in VideoLLMs
Source: arXiv - 2602.22142v1
Overview
The paper WeaveTime tackles a fundamental blind spot in today’s Video‑LLMs: they treat a video as a static collection of frames instead of a flowing, time‑ordered sequence. This “time‑agnostic” view makes it hard for models to reason about causality, follow events in the correct order, or keep the present frame distinct from past context—problems that become critical when processing live video streams. WeaveTime introduces a lightweight, model‑agnostic add‑on that teaches a Video‑LLM to perceive and use temporal order, all without redesigning the underlying architecture or requiring massive streaming datasets.
Key Contributions
- Temporal Reconstruction Objective – a simple “Streaming Order Perception” (SOP) loss that forces the model to reconstruct the correct chronological order of frames, injecting temporal awareness with only a few finetuning steps.
- Past‑Current Dynamic Focus Cache – an inference‑time mechanism that dynamically expands the history window only when the model’s uncertainty spikes, achieving a coarse‑to‑fine retrieval of past frames.
- Model‑agnostic Plug‑and‑Play Design – WeaveTime works with any off‑the‑shelf Video‑LLM (e.g., Flamingo‑Video, Video‑ChatGPT) without architectural changes, making it easy to adopt in existing pipelines.
- Efficiency Gains – By limiting history expansion to when it’s needed, the system reduces latency and GPU memory usage while still boosting streaming‑task accuracy.
- Empirical Validation – Consistent performance improvements across several streaming benchmarks (e.g., LiveQA, Streaming VQA) with lower inference time compared to baseline Video‑LLMs.
Methodology
-
Teach Order (Training Phase)
- The authors freeze the original Video‑LLM weights and add a lightweight temporal head.
- Using a Temporal Reconstruction loss, the model receives a shuffled mini‑batch of frames and must predict their original timestamps or reconstruct the correct order.
- This objective is applied on standard video datasets (no special streaming data needed), so the model learns order‑aware embeddings while preserving its visual‑language knowledge.
-
Use Order (Inference Phase)
- A Past‑Current Dynamic Focus Cache sits in front of the frozen Video‑LLM.
- For each incoming frame, the cache first runs a quick uncertainty estimator (e.g., entropy of the language decoder).
- If uncertainty is low, the model answers using only the current frame (fast path).
- If uncertainty exceeds a threshold, the cache pulls in a few strategically selected past frames (coarse‑to‑fine) and re‑runs the language generation, allowing the model to incorporate relevant history only when needed.
The whole pipeline adds < 5 % extra parameters and can be dropped into any existing Video‑LLM deployment with a single line of code.
Results & Findings
| Benchmark | Baseline Video‑LLM | + WeaveTime | Latency Δ |
|---|---|---|---|
| LiveQA (streaming VQA) | 62.4 % accuracy | 68.9 % | –12 % |
| Streaming VQA (temporal reasoning) | 58.1 % | 64.7 % | –9 % |
| Real‑time Captioning | 71.3 % BLEU‑4 | 75.5 % | –7 % |
- Accuracy boost: 5–7 % absolute gain on tasks that require temporal reasoning.
- Latency reduction: The dynamic cache cuts average inference time by ~10 % because many frames are answered in the fast path.
- Memory savings: Only a handful of past frames are kept in GPU memory at any moment, enabling deployment on edge GPUs (e.g., RTX 3060).
These results confirm that a modest amount of order‑aware finetuning plus a smart caching strategy can make a big difference for streaming video applications.
Practical Implications
- Live video assistants (e.g., real‑time sports commentary, surveillance monitoring) can now answer “what just happened?” questions without buffering the entire video feed.
- AR/VR pipelines that need low‑latency scene understanding can integrate WeaveTime to keep the current view sharp while still reasoning about recent actions.
- Edge deployment becomes feasible: developers can run a standard Video‑LLM on a modest GPU and add WeaveTime to meet strict latency budgets.
- Developer workflow is simplified—no need to collect massive streaming datasets or redesign model architectures; a few epochs of SOP finetuning and a plug‑in cache are enough.
- Open‑source release (code + weights) means teams can quickly benchmark and adapt the technique to domain‑specific video streams (e.g., medical endoscopy, industrial inspection).
Limitations & Future Work
- Temporal horizon: The cache currently expands only a few seconds into the past; very long‑range dependencies (e.g., minutes‑long narratives) may still be missed.
- Uncertainty heuristic: The trigger threshold is hand‑tuned; a more adaptive, learned policy could further reduce unnecessary history pulls.
- Evaluation scope: Benchmarks focus on English‑language tasks; multilingual or multimodal (audio‑visual) streaming scenarios remain unexplored.
- Future directions suggested by the authors include: hierarchical caching for multi‑scale temporal reasoning, integrating audio cues for richer context, and training the uncertainty estimator jointly with the SOP head for end‑to‑end optimization.
Authors
- Yulin Zhang
- Cheng Shi
- Sibei Yang
Paper Information
- arXiv ID: 2602.22142v1
- Categories: cs.CV
- Published: February 25, 2026
- PDF: Download PDF