[Paper] Accelerating Local LLMs on Resource-Constrained Edge Devices via Distributed Prompt Caching

Published: 3 days ago (February 26, 2026 at 04:53 AM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.22812v1

Overview

Local large‑language‑model (LLM) inference on tiny edge devices—think Raspberry Pi Zero 2W or similar micro‑controllers—has long been hamstrung by limited CPU, memory, and energy budgets. The paper Accelerating Local LLMs on Resource‑Constrained Edge Devices via Distributed Prompt Caching proposes a clever workaround: let a cluster of low‑end devices share the intermediate “prompt‑processing” states they compute, so each device can reuse work that another device has already done. By doing this, the authors achieve dramatic speed‑ups in both the time to the first token (TTFT) and the total generation time (TTLT).

Key Contributions

Distributed Prompt Cache: A system that spreads cached intermediate states (the hidden‑layer activations generated while processing a prompt) across multiple edge nodes, enabling any node to pull already‑computed states instead of recomputing them.
Partial‑Match Support: The cache can serve similar prompts, not just exact duplicates, by matching overlapping sub‑prompts and reusing the corresponding activations.
Bloom‑Filter Catalog: A lightweight, probabilistic index that tells a node whether a remote peer likely holds the needed state, cutting down unnecessary wireless traffic.
Real‑World Evaluation: Demonstrated on Gemma‑3 270 M (≈270 M parameters) with the MMLU benchmark, running on a fleet of Raspberry Pi Zero 2W devices, showing up to 93 % reduction in TTFT and 50 % reduction in TTLT on average.

Methodology

Prompt Decomposition: When a user submits a prompt, the system splits it into overlapping chunks (e.g., n‑gram windows). Each chunk corresponds to a set of internal activations that can be cached.
Local Cache Lookup: The device first checks its own memory for a matching chunk. If found, it reuses the stored activations, skipping the expensive forward pass for that segment.
Distributed Lookup via Catalog: If the local cache misses, the device queries a catalog—a Bloom filter maintained on each peer that encodes the set of chunk hashes the peer currently stores. Because Bloom filters are tiny (a few kilobytes) and can be broadcast over Wi‑Fi/BLE with negligible cost, they quickly tell the requester whether a remote node might have the needed chunk.
State Retrieval: When the catalog indicates a possible hit, the requester asks the corresponding peer for the exact activation tensor. The peer streams the tensor over the wireless link; the requester then stitches it into its own inference pipeline.
Partial Matching: Even if the exact chunk isn’t present, the system can fall back to the longest matching prefix/suffix, recomputing only the missing tail. This leverages the fact that many prompts share common phrasing (e.g., “Explain the difference between …”).
Cache Eviction & Consistency: Each node runs a simple LRU (least‑recently‑used) policy to keep the most useful chunks, and the Bloom filters are refreshed periodically to reflect evictions.

Results & Findings

Metric	Baseline (single Pi)	Distributed Cache	Improvement
TTFT (ms)	~1,200	~84	93 % faster
TTLT (ms)	~4,800	~2,400	50 % faster
Network Overhead	–	~0.3 MB per inference (mostly catalog traffic)	Negligible compared to compute savings
Cache Hit Rate (exact)	0 %	38 %	–
Cache Hit Rate (partial)	0 %	71 %	–

The experiments used the MMLU (Massive Multitask Language Understanding) suite, which contains a wide variety of question‑answer tasks. Even with the modest 270 M‑parameter Gemma‑3 model, the distributed cache turned a sluggish edge inference into a near‑real‑time experience.

Practical Implications

Edge‑AI Deployments: Companies can now run LLM‑powered features (e.g., on‑device assistants, anomaly detection, or code completion) on ultra‑low‑power hardware without offloading to the cloud, preserving privacy and reducing latency.
Cost Savings: By reusing computation across devices, the overall energy consumption per inference drops dramatically—critical for battery‑operated IoT fleets.
Scalable Collaboration: The Bloom‑filter catalog makes the system tolerant of intermittent connectivity; devices can still benefit from peers even when the network is lossy.
Framework Integration: The approach is orthogonal to model architecture, meaning it could be wrapped around existing inference runtimes (e.g., TensorFlow Lite, ONNX Runtime) with minimal code changes.
Developer Tooling: A simple API (e.g., cache.get_or_compute(prompt_chunk)) abstracts away the networking, letting developers focus on application logic while the runtime handles distributed caching under the hood.

Limitations & Future Work

Memory Footprint: Even with LRU eviction, storing activation tensors can quickly exhaust the few megabytes of RAM on devices like the Pi Zero 2W; smarter compression or quantization of cached states is needed.
Network Dependence: The speed gains assume a reasonably reliable Wi‑Fi/BLE link; in highly congested or high‑latency environments the communication cost could offset compute savings.
Security & Privacy: Sharing internal activations across devices raises questions about data leakage; future work should explore encryption or homomorphic techniques for the cached tensors.
Generalization to Larger Models: The study focused on a 270 M‑parameter model; scaling the method to 1‑2 B‑parameter models may require hierarchical caching or edge‑cloud hybrid designs.

Bottom line: Distributed prompt caching offers a pragmatic path to bring LLM capabilities to the tiniest edge devices, turning a once‑impractical compute problem into a collaborative, network‑enabled solution. As edge AI continues to grow, techniques that let devices “share work” rather than “reinvent the wheel” will become a cornerstone of low‑power, privacy‑preserving AI deployments.

Authors

Hiroki Matsutani
Naoki Matsuda
Naoto Sugiura

Paper Information

arXiv ID: 2602.22812v1
Categories: cs.LG, cs.DC
Published: February 26, 2026
PDF: Download PDF

[Paper] Accelerating Local LLMs on Resource-Constrained Edge Devices via Distributed Prompt Caching

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Model Agreement via Anchoring

[Paper] SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

[Paper] A Dataset is Worth 1 MB

[Paper] SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport