[Paper] Memory Caching: RNNs with Growing Memory

Published: 3 days ago (February 27, 2026 at 01:53 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.24281v1

Overview

The paper “Memory Caching: RNNs with Growing Memory” proposes a lightweight add‑on that lets recurrent neural networks (RNNs) expand their effective memory as a sequence gets longer. By checkpoint‑caching hidden states, the authors bridge the gap between the linear‑time, fixed‑size memory of classic RNNs and the quadratic‑time, ever‑growing memory of Transformers—offering a tunable trade‑off that can be deployed on today’s hardware.

Key Contributions

Memory Caching (MC) technique: a simple mechanism to store and reuse past hidden‑state checkpoints, effectively enlarging an RNN’s memory capacity without changing its core recurrence.
Four MC variants:
1. Plain caching – naïve storage of every hidden state.
2. Gated aggregation – learns a weighted blend of cached states.
3. Sparse selective caching – keeps only a subset of checkpoints based on a learned importance score.
4. Hybrid deep‑memory caching – integrates MC with multi‑layer (deep) memory modules.
Complexity interpolation: MC can be configured to run in (O(L)) (RNN‑like) up to (O(L^2)) (Transformer‑like) time, letting practitioners pick the sweet spot for latency vs. accuracy.
Empirical validation: Demonstrates consistent gains on language modeling benchmarks (e.g., WikiText‑103) and long‑context reasoning tasks, narrowing the performance gap to Transformers while staying cheaper.
Open‑source implementation: The authors release code and pretrained checkpoints, making it easy for developers to plug MC into existing RNN pipelines.

Methodology

Baseline RNN – The authors start with a standard recurrent architecture (e.g., LSTM or GRU) that processes a token sequence ({x_t}_{t=1}^L) and produces hidden states (h_t).
Checkpointing – At configurable intervals (or when a learned “importance” signal spikes), the current hidden state is saved to a cache (C = {c_1, …, c_K}).
Memory read‑out – When the RNN needs to produce an output at step (t), it queries the cache.
- Plain MC simply concatenates or averages all cached states.
- Gated aggregation learns a gate (g_k = \sigma(W_g c_k + b_g)) and computes (\tilde{h}_t = \sum_k g_k c_k).
- Sparse selective MC applies a top‑k selection on a scoring function (s_k = f(c_k)) to keep only the most relevant checkpoints.
Integration – The retrieved memory (\tilde{h}_t) is merged with the current hidden state (e.g., via addition or a small feed‑forward network) before the final output layer.
Training – The whole system remains end‑to‑end differentiable; the cache operations are implemented with efficient tensor indexing, so training overhead stays modest.

Results & Findings

Task	Model	Perplexity / Accuracy	Relative Cost
WikiText‑103 (LM)	LSTM (baseline)	34.2	1×
	LSTM + Plain MC (full cache)	30.8	1.3×
	LSTM + Gated MC	30.5	1.4×
	LSTM + Sparse MC (top‑10%)	31.2	1.2×
Long‑Context QA	Deep RNN	68.4% F1	1×
	Deep RNN + Hybrid MC	71.9% F1	1.5×
In‑Context Recall	Transformer (baseline)	92.1%	1×
	RNN + Gated MC	89.4%	0.6×

Performance boost: All MC variants improve perplexity and downstream task scores, with gated aggregation giving the strongest lift.
Efficiency: Even the full‑cache version stays well below the quadratic cost of a Transformer, and the sparse version can be tuned to run almost as fast as a vanilla RNN.
Memory‑accuracy trade‑off: By adjusting cache size or sparsity, developers can dial in the desired balance—e.g., a 10 % cache yields ~90 % of the full‑cache gain at < 20 % extra compute.

Practical Implications

Deployable on edge / low‑power devices: MC lets you keep the lightweight recurrence of RNNs while handling longer contexts (e.g., chat histories, streaming logs) without blowing up memory or latency.
Plug‑and‑play upgrade: Existing LSTM/GRU codebases can adopt MC with a few lines of wrapper code; no need to rewrite the whole model or switch to a Transformer stack.
Cost‑effective scaling: For SaaS platforms that process massive text streams, MC offers a middle ground—better recall than plain RNNs, cheaper than running full‑scale Transformers.
Potential for hybrid architectures: MC can be combined with recent linear‑attention Transformers, yielding “memory‑augmented” hybrids that further push the limits of context length.
Research reuse: The open‑source cache modules can serve as a building block for other sequence‑heavy domains such as DNA‑seq analysis, time‑series forecasting, or reinforcement‑learning agents that need long‑term state.

Limitations & Future Work

Cache management overhead: While the authors keep it low, very long sequences (hundreds of thousands of steps) still require careful tuning of cache size and eviction policy to avoid GPU memory spikes.
Task‑specific tuning: The optimal sparsity level or gating architecture varies across domains; a one‑size‑fits‑all setting is not yet identified.
Comparison scope: Experiments focus on language modeling and recall tasks; broader benchmarks (e.g., multimodal video captioning, code generation) remain unexplored.
Future directions suggested by the authors include:
1. Learning dynamic cache‑update schedules,
2. Integrating MC with retrieval‑augmented models, and
3. Extending the technique to non‑RNN recurrent structures such as Neural ODEs or state‑space models.

Authors

Ali Behrouz
Zeman Li
Yuan Deng
Peilin Zhong
Meisam Razaviyayn
Vahab Mirrokni

Paper Information

arXiv ID: 2602.24281v1
Categories: cs.LG, cs.AI
Published: February 27, 2026
PDF: Download PDF

[Paper] Memory Caching: RNNs with Growing Memory

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Mode Seeking meets Mean Seeking for Fast Long Video Generation

[Paper] DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science

[Paper] Do LLMs Benefit From Their Own Words?

[Paper] CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation