[Paper] Memory Caching: RNNs with Growing Memory

Published: (February 27, 2026 at 01:53 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.24281v1

Overview

The paper “Memory Caching: RNNs with Growing Memory” proposes a lightweight add‑on that lets recurrent neural networks (RNNs) expand their effective memory as a sequence gets longer. By checkpoint‑caching hidden states, the authors bridge the gap between the linear‑time, fixed‑size memory of classic RNNs and the quadratic‑time, ever‑growing memory of Transformers—offering a tunable trade‑off that can be deployed on today’s hardware.

Key Contributions

  • Memory Caching (MC) technique: a simple mechanism to store and reuse past hidden‑state checkpoints, effectively enlarging an RNN’s memory capacity without changing its core recurrence.
  • Four MC variants:
    1. Plain caching – naïve storage of every hidden state.
    2. Gated aggregation – learns a weighted blend of cached states.
    3. Sparse selective caching – keeps only a subset of checkpoints based on a learned importance score.
    4. Hybrid deep‑memory caching – integrates MC with multi‑layer (deep) memory modules.
  • Complexity interpolation: MC can be configured to run in (O(L)) (RNN‑like) up to (O(L^2)) (Transformer‑like) time, letting practitioners pick the sweet spot for latency vs. accuracy.
  • Empirical validation: Demonstrates consistent gains on language modeling benchmarks (e.g., WikiText‑103) and long‑context reasoning tasks, narrowing the performance gap to Transformers while staying cheaper.
  • Open‑source implementation: The authors release code and pretrained checkpoints, making it easy for developers to plug MC into existing RNN pipelines.

Methodology

  1. Baseline RNN – The authors start with a standard recurrent architecture (e.g., LSTM or GRU) that processes a token sequence ({x_t}_{t=1}^L) and produces hidden states (h_t).
  2. Checkpointing – At configurable intervals (or when a learned “importance” signal spikes), the current hidden state is saved to a cache (C = {c_1, …, c_K}).
  3. Memory read‑out – When the RNN needs to produce an output at step (t), it queries the cache.
    • Plain MC simply concatenates or averages all cached states.
    • Gated aggregation learns a gate (g_k = \sigma(W_g c_k + b_g)) and computes (\tilde{h}_t = \sum_k g_k c_k).
    • Sparse selective MC applies a top‑k selection on a scoring function (s_k = f(c_k)) to keep only the most relevant checkpoints.
  4. Integration – The retrieved memory (\tilde{h}_t) is merged with the current hidden state (e.g., via addition or a small feed‑forward network) before the final output layer.
  5. Training – The whole system remains end‑to‑end differentiable; the cache operations are implemented with efficient tensor indexing, so training overhead stays modest.

Results & Findings

TaskModelPerplexity / AccuracyRelative Cost
WikiText‑103 (LM)LSTM (baseline)34.2
LSTM + Plain MC (full cache)30.81.3×
LSTM + Gated MC30.51.4×
LSTM + Sparse MC (top‑10%)31.21.2×
Long‑Context QADeep RNN68.4% F1
Deep RNN + Hybrid MC71.9% F11.5×
In‑Context RecallTransformer (baseline)92.1%
RNN + Gated MC89.4%0.6×
  • Performance boost: All MC variants improve perplexity and downstream task scores, with gated aggregation giving the strongest lift.
  • Efficiency: Even the full‑cache version stays well below the quadratic cost of a Transformer, and the sparse version can be tuned to run almost as fast as a vanilla RNN.
  • Memory‑accuracy trade‑off: By adjusting cache size or sparsity, developers can dial in the desired balance—e.g., a 10 % cache yields ~90 % of the full‑cache gain at < 20 % extra compute.

Practical Implications

  • Deployable on edge / low‑power devices: MC lets you keep the lightweight recurrence of RNNs while handling longer contexts (e.g., chat histories, streaming logs) without blowing up memory or latency.
  • Plug‑and‑play upgrade: Existing LSTM/GRU codebases can adopt MC with a few lines of wrapper code; no need to rewrite the whole model or switch to a Transformer stack.
  • Cost‑effective scaling: For SaaS platforms that process massive text streams, MC offers a middle ground—better recall than plain RNNs, cheaper than running full‑scale Transformers.
  • Potential for hybrid architectures: MC can be combined with recent linear‑attention Transformers, yielding “memory‑augmented” hybrids that further push the limits of context length.
  • Research reuse: The open‑source cache modules can serve as a building block for other sequence‑heavy domains such as DNA‑seq analysis, time‑series forecasting, or reinforcement‑learning agents that need long‑term state.

Limitations & Future Work

  • Cache management overhead: While the authors keep it low, very long sequences (hundreds of thousands of steps) still require careful tuning of cache size and eviction policy to avoid GPU memory spikes.
  • Task‑specific tuning: The optimal sparsity level or gating architecture varies across domains; a one‑size‑fits‑all setting is not yet identified.
  • Comparison scope: Experiments focus on language modeling and recall tasks; broader benchmarks (e.g., multimodal video captioning, code generation) remain unexplored.
  • Future directions suggested by the authors include:
    1. Learning dynamic cache‑update schedules,
    2. Integrating MC with retrieval‑augmented models, and
    3. Extending the technique to non‑RNN recurrent structures such as Neural ODEs or state‑space models.

Authors

  • Ali Behrouz
  • Zeman Li
  • Yuan Deng
  • Peilin Zhong
  • Meisam Razaviyayn
  • Vahab Mirrokni

Paper Information

  • arXiv ID: 2602.24281v1
  • Categories: cs.LG, cs.AI
  • Published: February 27, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »