[Paper] LLMCache: Layer-Wise Caching Strategies for Accelerated Reuse in Transformer Inference

Published: (December 18, 2025 at 01:18 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.16843v1

Overview

The paper introduces LLMCache, a layer‑wise caching system that speeds up inference for transformer models by reusing intermediate activations when new inputs are semantically similar to previously seen ones. By operating at any transformer layer and being model‑agnostic, LLMCache promises noticeable latency reductions without sacrificing accuracy, making it attractive for real‑time and large‑scale deployments.

Key Contributions

  • Layer‑wise caching framework that works for both encoder‑only (e.g., BERT) and decoder‑only (e.g., GPT‑2) architectures.
  • Semantic fingerprinting: a lightweight method to detect when a new input is “close enough” to a cached one, enabling reuse of hidden states.
  • Adaptive eviction policies that balance cache freshness against memory pressure, preventing stale activations from hurting model quality.
  • Model‑agnostic design: no changes to the underlying transformer weights or training pipeline are required.
  • Empirical validation across three benchmarks (SQuAD, WikiText‑103, OpenBookQA) showing up to 3.1× speed‑up with < 0.5 % accuracy loss.

Methodology

  1. Fingerprint Generation – For each incoming sequence, LLMCache computes a compact “semantic fingerprint” (e.g., a short hash derived from a low‑dimensional projection of the first‑layer token embeddings).
  2. Similarity Lookup – The fingerprint is compared against entries already stored in the cache. If a match exceeds a configurable similarity threshold, the system treats the new input as a near‑duplicate.
  3. Activation Reuse – Instead of recomputing every layer, LLMCache retrieves the cached hidden states from the deepest matching layer and resumes forward propagation from that point onward.
  4. Cache Management – An adaptive eviction strategy monitors usage frequency, recency, and a freshness score (based on how far the cached activations are from the current model parameters) to decide which entries to drop.
  5. Integration – The caching logic is wrapped around the standard transformer forward pass, requiring only a thin plug‑in layer; no retraining or model‑specific modifications are needed.

Results & Findings

Model / TaskBaseline Latency (ms)LLMCache Latency (ms)Speed‑upAccuracy Δ
BERT‑Base (SQuAD)48163.0×–0.3 %
GPT‑2 (WikiText‑103)62203.1×–0.4 %
BERT‑Large (OpenBookQA)71282.5×–0.5 %
  • Cache hit rates ranged from 38 % to 62 % depending on the dataset’s redundancy, confirming that many real‑world inputs share enough semantic overlap to benefit from reuse.
  • Memory overhead stayed under 1 GB for a cache size of 10 k entries on a single GPU, well within typical production budgets.
  • Ablation studies showed that fingerprint dimensionality of 64 bits offered the best trade‑off between hit‑rate and collision risk.

Practical Implications

  • Real‑time services (chatbots, code assistants, search) can shave tens of milliseconds per request, translating into higher throughput and lower cloud costs.
  • Edge deployments (mobile or IoT devices) gain a viable path to run larger LLMs locally because the cache reduces the number of expensive matrix multiplications.
  • Batch processing pipelines (e.g., document summarization) can reuse activations across similar documents, cutting total inference time dramatically without altering the model.
  • Framework integration – The authors released a PyTorch‑compatible library that can be dropped into existing inference servers (e.g., TorchServe, FastAPI) with a single decorator, lowering the barrier for adoption.

Limitations & Future Work

  • Cache effectiveness hinges on input redundancy; highly diverse streams (e.g., random queries) yield low hit rates, limiting speed‑ups.
  • The current fingerprinting scheme is static; dynamic, learned similarity metrics could capture richer semantics.
  • Cache consistency under model updates (e.g., fine‑tuning) is not fully explored—future work could investigate automatic invalidation or versioned caches.
  • Scaling to multi‑GPU or distributed settings introduces synchronization overhead; the paper leaves distributed cache coherence as an open challenge.

Overall, LLMCache offers a pragmatic, model‑agnostic tool for developers looking to squeeze more performance out of transformer inference without sacrificing accuracy—a compelling addition to the performance‑engineering toolbox.

Authors

  • Harsh Vardhan Bansal

Paper Information

  • arXiv ID: 2512.16843v1
  • Categories: cs.CL, cs.AI
  • Published: December 18, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

[Paper] When Reasoning Meets Its Laws

Despite the superior performance of Large Reasoning Models (LRMs), their reasoning behaviors are often counterintuitive, leading to suboptimal reasoning capabil...