[Paper] LIME:Accelerating Collaborative Lossless LLM Inference on Memory-Constrained Edge Devices

Published: (December 25, 2025 at 09:41 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.21835v1

Overview

The paper introduces LIME, a system that lets multiple edge devices collaborate to run huge language models (e.g., LLaMA‑3‑70B) without losing any accuracy. By cleverly splitting the model’s work across devices and adapting to tight memory and bandwidth limits, LIME makes “large‑model” inference feasible on hardware that would normally be far too small.

Key Contributions

  • Lossless collaborative inference: Enables full‑precision LLM execution across several memory‑constrained edge nodes, preserving the model’s original accuracy.
  • Interleaved pipeline parallelism + offloading: A novel scheduling scheme that interleaves computation and communication, keeping every device busy while minimizing data transfer.
  • Fine‑grained offline allocation planner: Determines the optimal placement of model layers on each device before deployment, taking heterogeneous memory/compute capabilities into account.
  • Online memory‑adaptation engine: Dynamically reallocates tensors during runtime to react to bursty request patterns and temporary memory pressure.
  • Real‑world evaluation on heterogeneous Nvidia Jetson boards: Demonstrates up to 3.7× speedup over the best existing edge‑LLM baselines for the 70‑billion‑parameter LLaMA‑3‑Instruct model.

Methodology

  1. Model Partitioning – LIME first breaks the giant transformer into a sequence of layers. An offline optimizer maps each layer (or group of layers) to a specific Jetson device, respecting each board’s RAM and compute budget.
  2. Interleaved Pipeline Parallelism – Instead of the classic “stage‑by‑stage” pipeline where a device must finish its whole chunk before passing data, LIME interleaves forward‑pass fragments. While one device sends its output to the next, it immediately starts processing the next input token, overlapping communication and computation.
  3. Dynamic Offloading – Large activation tensors that don’t fit in on‑chip memory are temporarily spilled to a shared high‑speed NVMe cache or to a neighboring device’s memory, then fetched just‑in‑time.
  4. Online Memory Adaptation – A lightweight runtime monitor watches memory usage and request arrival rates. When a burst occurs, LIME can reshuffle pending layers or temporarily duplicate small sub‑modules on idle devices to keep latency low.
  5. Implementation Stack – Built on top of PyTorch and NVIDIA’s TensorRT, with custom CUDA kernels for the interleaved pipeline and a lightweight RPC layer for cross‑device tensor exchange.

Results & Findings

MetricBaseline (single Jetson)LIME (4‑device)SpeedupAccuracy Impact
End‑to‑end latency (average)1,200 ms710 ms (sporadic) / 320 ms (bursty)1.7× / 3.7×0 % (identical)
Peak memory per device12 GB (exceeds)≤ 6 GB (fits)
Network bandwidth used (average)2 Gbps (continuous)0.6 Gbps (bursty)
  • Lossless inference: No measurable drop in perplexity or downstream task scores compared with running the full model on a server‑grade GPU.
  • Scalability: Adding a fourth heterogeneous Jetson (different CPU/GPU ratios) still yielded net gains, confirming the scheduler’s ability to handle non‑uniform hardware.
  • Robustness to traffic patterns: Under bursty request arrivals (e.g., a sudden spike of 10 prompts), LIME’s online adaptation kept latency low, whereas static pipelines stalled.

Practical Implications

  • Edge AI products (smart cameras, robotics, AR/VR headsets) can now embed state‑of‑the‑art LLMs for on‑device reasoning, reducing reliance on cloud APIs and improving privacy.
  • Cost‑effective deployment: Companies can leverage inexpensive Jetson‑class hardware clusters instead of expensive data‑center GPUs for inference‑heavy workloads.
  • Network‑aware AI services: By keeping bandwidth usage modest, LIME enables real‑time LLM responses even on 5G or congested Wi‑Fi links, opening up new use‑cases like offline assistants or remote field diagnostics.
  • Developer‑friendly stack: The authors release the scheduler and runtime as a Python library, making it straightforward to plug into existing PyTorch pipelines.

Limitations & Future Work

  • Hardware dependence: The current prototype targets NVIDIA Jetson devices; extending to other edge accelerators (e.g., Google Edge TPU, AMD Ryzen AI) will require additional kernel work.
  • Static offline planner: While the runtime can adapt memory on‑the‑fly, the initial layer placement is computed once per model. Rapid model updates would need re‑planning.
  • Security considerations: Inter‑device tensor exchange assumes a trusted local network; future versions should incorporate encryption or secure enclaves for hostile edge environments.
  • Scaling beyond four nodes: The paper shows promising results up to four devices; exploring larger clusters and the associated communication overhead is left for future research.

LIME demonstrates that with smart scheduling and collaborative pipelines, the “edge‑only” myth of massive LLMs can finally be busted—bringing truly large‑scale language intelligence to the devices that sit at the front line of user interaction.

Authors

  • Mingyu Sun
  • Xiao Zhang
  • Shen Qu
  • Yan Li
  • Mengbai Xiao
  • Yuan Yuan
  • Dongxiao Yu

Paper Information

  • arXiv ID: 2512.21835v1
  • Categories: cs.DC
  • Published: December 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »