[Paper] Memory Caching: RNNs with Growing Memory
Source: arXiv - 2602.24281v1
Overview
The paper “Memory Caching: RNNs with Growing Memory” proposes a lightweight add‑on that lets recurrent neural networks (RNNs) expand their effective memory as a sequence gets longer. By checkpoint‑caching hidden states, the authors bridge the gap between the linear‑time, fixed‑size memory of classic RNNs and the quadratic‑time, ever‑growing memory of Transformers—offering a tunable trade‑off that can be deployed on today’s hardware.
Key Contributions
- Memory Caching (MC) technique: a simple mechanism to store and reuse past hidden‑state checkpoints, effectively enlarging an RNN’s memory capacity without changing its core recurrence.
- Four MC variants:
- Plain caching – naïve storage of every hidden state.
- Gated aggregation – learns a weighted blend of cached states.
- Sparse selective caching – keeps only a subset of checkpoints based on a learned importance score.
- Hybrid deep‑memory caching – integrates MC with multi‑layer (deep) memory modules.
- Complexity interpolation: MC can be configured to run in (O(L)) (RNN‑like) up to (O(L^2)) (Transformer‑like) time, letting practitioners pick the sweet spot for latency vs. accuracy.
- Empirical validation: Demonstrates consistent gains on language modeling benchmarks (e.g., WikiText‑103) and long‑context reasoning tasks, narrowing the performance gap to Transformers while staying cheaper.
- Open‑source implementation: The authors release code and pretrained checkpoints, making it easy for developers to plug MC into existing RNN pipelines.
Methodology
- Baseline RNN – The authors start with a standard recurrent architecture (e.g., LSTM or GRU) that processes a token sequence ({x_t}_{t=1}^L) and produces hidden states (h_t).
- Checkpointing – At configurable intervals (or when a learned “importance” signal spikes), the current hidden state is saved to a cache (C = {c_1, …, c_K}).
- Memory read‑out – When the RNN needs to produce an output at step (t), it queries the cache.
- Plain MC simply concatenates or averages all cached states.
- Gated aggregation learns a gate (g_k = \sigma(W_g c_k + b_g)) and computes (\tilde{h}_t = \sum_k g_k c_k).
- Sparse selective MC applies a top‑k selection on a scoring function (s_k = f(c_k)) to keep only the most relevant checkpoints.
- Integration – The retrieved memory (\tilde{h}_t) is merged with the current hidden state (e.g., via addition or a small feed‑forward network) before the final output layer.
- Training – The whole system remains end‑to‑end differentiable; the cache operations are implemented with efficient tensor indexing, so training overhead stays modest.
Results & Findings
| Task | Model | Perplexity / Accuracy | Relative Cost |
|---|---|---|---|
| WikiText‑103 (LM) | LSTM (baseline) | 34.2 | 1× |
| LSTM + Plain MC (full cache) | 30.8 | 1.3× | |
| LSTM + Gated MC | 30.5 | 1.4× | |
| LSTM + Sparse MC (top‑10%) | 31.2 | 1.2× | |
| Long‑Context QA | Deep RNN | 68.4% F1 | 1× |
| Deep RNN + Hybrid MC | 71.9% F1 | 1.5× | |
| In‑Context Recall | Transformer (baseline) | 92.1% | 1× |
| RNN + Gated MC | 89.4% | 0.6× |
- Performance boost: All MC variants improve perplexity and downstream task scores, with gated aggregation giving the strongest lift.
- Efficiency: Even the full‑cache version stays well below the quadratic cost of a Transformer, and the sparse version can be tuned to run almost as fast as a vanilla RNN.
- Memory‑accuracy trade‑off: By adjusting cache size or sparsity, developers can dial in the desired balance—e.g., a 10 % cache yields ~90 % of the full‑cache gain at < 20 % extra compute.
Practical Implications
- Deployable on edge / low‑power devices: MC lets you keep the lightweight recurrence of RNNs while handling longer contexts (e.g., chat histories, streaming logs) without blowing up memory or latency.
- Plug‑and‑play upgrade: Existing LSTM/GRU codebases can adopt MC with a few lines of wrapper code; no need to rewrite the whole model or switch to a Transformer stack.
- Cost‑effective scaling: For SaaS platforms that process massive text streams, MC offers a middle ground—better recall than plain RNNs, cheaper than running full‑scale Transformers.
- Potential for hybrid architectures: MC can be combined with recent linear‑attention Transformers, yielding “memory‑augmented” hybrids that further push the limits of context length.
- Research reuse: The open‑source cache modules can serve as a building block for other sequence‑heavy domains such as DNA‑seq analysis, time‑series forecasting, or reinforcement‑learning agents that need long‑term state.
Limitations & Future Work
- Cache management overhead: While the authors keep it low, very long sequences (hundreds of thousands of steps) still require careful tuning of cache size and eviction policy to avoid GPU memory spikes.
- Task‑specific tuning: The optimal sparsity level or gating architecture varies across domains; a one‑size‑fits‑all setting is not yet identified.
- Comparison scope: Experiments focus on language modeling and recall tasks; broader benchmarks (e.g., multimodal video captioning, code generation) remain unexplored.
- Future directions suggested by the authors include:
- Learning dynamic cache‑update schedules,
- Integrating MC with retrieval‑augmented models, and
- Extending the technique to non‑RNN recurrent structures such as Neural ODEs or state‑space models.
Authors
- Ali Behrouz
- Zeman Li
- Yuan Deng
- Peilin Zhong
- Meisam Razaviyayn
- Vahab Mirrokni
Paper Information
- arXiv ID: 2602.24281v1
- Categories: cs.LG, cs.AI
- Published: February 27, 2026
- PDF: Download PDF