[Paper] Accelerating Local LLMs on Resource-Constrained Edge Devices via Distributed Prompt Caching

Published: (February 26, 2026 at 04:53 AM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.22812v1

Overview

Local large‑language‑model (LLM) inference on tiny edge devices—think Raspberry Pi Zero 2W or similar micro‑controllers—has long been hamstrung by limited CPU, memory, and energy budgets. The paper Accelerating Local LLMs on Resource‑Constrained Edge Devices via Distributed Prompt Caching proposes a clever workaround: let a cluster of low‑end devices share the intermediate “prompt‑processing” states they compute, so each device can reuse work that another device has already done. By doing this, the authors achieve dramatic speed‑ups in both the time to the first token (TTFT) and the total generation time (TTLT).

Key Contributions

  • Distributed Prompt Cache: A system that spreads cached intermediate states (the hidden‑layer activations generated while processing a prompt) across multiple edge nodes, enabling any node to pull already‑computed states instead of recomputing them.
  • Partial‑Match Support: The cache can serve similar prompts, not just exact duplicates, by matching overlapping sub‑prompts and reusing the corresponding activations.
  • Bloom‑Filter Catalog: A lightweight, probabilistic index that tells a node whether a remote peer likely holds the needed state, cutting down unnecessary wireless traffic.
  • Real‑World Evaluation: Demonstrated on Gemma‑3 270 M (≈270 M parameters) with the MMLU benchmark, running on a fleet of Raspberry Pi Zero 2W devices, showing up to 93 % reduction in TTFT and 50 % reduction in TTLT on average.

Methodology

  1. Prompt Decomposition: When a user submits a prompt, the system splits it into overlapping chunks (e.g., n‑gram windows). Each chunk corresponds to a set of internal activations that can be cached.
  2. Local Cache Lookup: The device first checks its own memory for a matching chunk. If found, it reuses the stored activations, skipping the expensive forward pass for that segment.
  3. Distributed Lookup via Catalog: If the local cache misses, the device queries a catalog—a Bloom filter maintained on each peer that encodes the set of chunk hashes the peer currently stores. Because Bloom filters are tiny (a few kilobytes) and can be broadcast over Wi‑Fi/BLE with negligible cost, they quickly tell the requester whether a remote node might have the needed chunk.
  4. State Retrieval: When the catalog indicates a possible hit, the requester asks the corresponding peer for the exact activation tensor. The peer streams the tensor over the wireless link; the requester then stitches it into its own inference pipeline.
  5. Partial Matching: Even if the exact chunk isn’t present, the system can fall back to the longest matching prefix/suffix, recomputing only the missing tail. This leverages the fact that many prompts share common phrasing (e.g., “Explain the difference between …”).
  6. Cache Eviction & Consistency: Each node runs a simple LRU (least‑recently‑used) policy to keep the most useful chunks, and the Bloom filters are refreshed periodically to reflect evictions.

Results & Findings

MetricBaseline (single Pi)Distributed CacheImprovement
TTFT (ms)~1,200~8493 % faster
TTLT (ms)~4,800~2,40050 % faster
Network Overhead~0.3 MB per inference (mostly catalog traffic)Negligible compared to compute savings
Cache Hit Rate (exact)0 %38 %
Cache Hit Rate (partial)0 %71 %

The experiments used the MMLU (Massive Multitask Language Understanding) suite, which contains a wide variety of question‑answer tasks. Even with the modest 270 M‑parameter Gemma‑3 model, the distributed cache turned a sluggish edge inference into a near‑real‑time experience.

Practical Implications

  • Edge‑AI Deployments: Companies can now run LLM‑powered features (e.g., on‑device assistants, anomaly detection, or code completion) on ultra‑low‑power hardware without offloading to the cloud, preserving privacy and reducing latency.
  • Cost Savings: By reusing computation across devices, the overall energy consumption per inference drops dramatically—critical for battery‑operated IoT fleets.
  • Scalable Collaboration: The Bloom‑filter catalog makes the system tolerant of intermittent connectivity; devices can still benefit from peers even when the network is lossy.
  • Framework Integration: The approach is orthogonal to model architecture, meaning it could be wrapped around existing inference runtimes (e.g., TensorFlow Lite, ONNX Runtime) with minimal code changes.
  • Developer Tooling: A simple API (e.g., cache.get_or_compute(prompt_chunk)) abstracts away the networking, letting developers focus on application logic while the runtime handles distributed caching under the hood.

Limitations & Future Work

  • Memory Footprint: Even with LRU eviction, storing activation tensors can quickly exhaust the few megabytes of RAM on devices like the Pi Zero 2W; smarter compression or quantization of cached states is needed.
  • Network Dependence: The speed gains assume a reasonably reliable Wi‑Fi/BLE link; in highly congested or high‑latency environments the communication cost could offset compute savings.
  • Security & Privacy: Sharing internal activations across devices raises questions about data leakage; future work should explore encryption or homomorphic techniques for the cached tensors.
  • Generalization to Larger Models: The study focused on a 270 M‑parameter model; scaling the method to 1‑2 B‑parameter models may require hierarchical caching or edge‑cloud hybrid designs.

Bottom line: Distributed prompt caching offers a pragmatic path to bring LLM capabilities to the tiniest edge devices, turning a once‑impractical compute problem into a collaborative, network‑enabled solution. As edge AI continues to grow, techniques that let devices “share work” rather than “reinvent the wheel” will become a cornerstone of low‑power, privacy‑preserving AI deployments.

Authors

  • Hiroki Matsutani
  • Naoki Matsuda
  • Naoto Sugiura

Paper Information

  • arXiv ID: 2602.22812v1
  • Categories: cs.LG, cs.DC
  • Published: February 26, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Model Agreement via Anchoring

Numerous lines of aim to control model disagreement -- the extent to which two machine learning models disagree in their predictions. We adopt a simple and stan...

[Paper] A Dataset is Worth 1 MB

A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate on divers...