[Paper] Splitwise: Collaborative Edge-Cloud Inference for LLMs via Lyapunov-Assisted DRL

Published: 3 weeks ago (December 29, 2025 at 03:57 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.23310v1

Overview

Deploying today’s massive language models on edge devices (smartphones, IoT boards, autonomous robots) is a pain point: the models don’t fit in memory, and running them locally burns power. Cloud‑only inference solves the memory issue but adds latency, bandwidth costs, and can be unreliable when the network is flaky. Splitwise tackles this head‑on with a dynamic, fine‑grained edge‑cloud partitioning strategy that continuously adapts to workload and network conditions, delivering faster, greener inference without sacrificing model quality.

Key Contributions

Fine‑grained partitioning – Breaks transformer layers into attention heads and feed‑forward sub‑blocks, expanding the design space far beyond traditional layer‑wise splits.
Lyapunov‑assisted DRL controller – A hierarchical deep‑reinforcement‑learning policy, regularized by Lyapunov optimization, jointly minimizes latency, energy, and accuracy loss while guaranteeing queue stability under stochastic request arrivals.
Robust checkpoint & recovery – Introduces exponential‑backoff checkpointing to gracefully handle intermittent network failures.
Comprehensive evaluation – Real‑world experiments on Jetson Orin NX, Galaxy S23, and Raspberry Pi 5 using GPT‑2 (1.5 B), LLaMA‑7 B, and LLaMA‑13 B show up to 2.8× latency reduction and 41 % energy savings versus state‑of‑the‑art partitioners.
QoS guarantees – Cuts the 95th‑percentile latency by 53‑61 % compared with pure cloud inference while keeping model accuracy intact.

Methodology

Model Decomposition – Each transformer layer is split into two logical sub‑blocks:
- (a) Multi‑head self‑attention (MHA) heads
- (b) Feed‑forward network (FFN)
This yields many more placement options (e.g., some heads on edge, others on cloud).
Hierarchical DRL Policy –
- High‑level agent decides how many sub‑blocks to offload based on current queue length, device battery, and network bandwidth.
- Low‑level agent selects the exact sub‑blocks (which heads, which FFN slices) to place on edge vs. cloud.
Lyapunov Optimization – A Lyapunov function measures system “drift” (queue growth). By minimizing a drift‑plus‑penalty term, the controller ensures the request queue stays stable (no unbounded backlog) while optimizing a weighted sum of latency, energy, and accuracy loss.
Checkpointing & Recovery – After each inference step, a lightweight checkpoint is streamed to the cloud. If a transmission fails, the system backs off exponentially and retries, preventing total job failure.
Training & Deployment – The DRL agents are trained offline on a simulated workload that mimics real‑world request patterns and bandwidth traces. The learned policy is then embedded as a lightweight runtime library on the edge device.

Results & Findings

Platform	Model	Baseline (cloud‑only)	Splitwise	Latency ↓	Energy ↓	95th‑pct Latency ↓
Jetson Orin NX	LLaMA‑7B	210 ms	78 ms	2.7×	38 %	58 %
Galaxy S23	GPT‑2 1.5B	180 ms	65 ms	2.8×	41 %	61 %
Raspberry Pi 5	LLaMA‑13B	420 ms	150 ms	2.8×	35 %	53 %

Accuracy remained within 0.2 % of the full‑cloud baseline, confirming that the fine‑grained split does not introduce noticeable quantization or approximation errors.
The DRL controller reacted to sudden bandwidth drops (e.g., from 30 Mbps to 5 Mbps) by shifting more heads to the edge, keeping tail latency low.
Checkpoint recovery added < 5 ms overhead even under a 30 % packet‑loss scenario.

Practical Implications

Edge‑first AI products – Mobile apps, AR/VR experiences, and robotics can now run sophisticated LLMs locally without sacrificing responsiveness or draining the battery.
Cost‑effective cloud usage – By offloading only the most compute‑heavy sub‑blocks, data‑center load and egress bandwidth bills drop dramatically.
Dynamic QoS provisioning – Service providers can embed Splitwise to guarantee latency SLAs even when users roam between Wi‑Fi and cellular networks.
Developer‑friendly SDK – The authors release a lightweight C++/Python library that abstracts the DRL policy behind a simple infer() call, making integration into existing pipelines painless.
Security & privacy – Sensitive prompt data can stay on‑device for the attention heads that process user‑specific context, reducing exposure to the cloud.

Limitations & Future Work

Training overhead – The DRL policy requires a simulated environment and several hours of training for each new model‑hardware combo, which may be a barrier for rapid prototyping.
Model size ceiling – Experiments stopped at 13 B parameters; scaling to 70 B‑class models may need additional hierarchical splitting (e.g., across multiple cloud nodes).
Network assumptions – The current design assumes a relatively stable TCP connection; bursty UDP‑based streaming or satellite links need separate robustness mechanisms.

Future directions proposed by the authors:

Meta‑learning to transfer policies across models.
Extending the framework to multi‑edge scenarios (e.g., edge‑to‑edge collaboration).
Incorporating quantization‑aware splitting to push the memory envelope even further.

Authors

Abolfazl Younesi
Abbas Shabrang Maryan
Elyas Oustad
Zahra Najafabadi Samani
Mohsen Ansari
Thomas Fahringer

Paper Information

arXiv ID: 2512.23310v1
Categories: cs.LG, cs.AI, cs.DC, cs.ET, cs.NI
Published: December 29, 2025
PDF: Download PDF

[Paper] Splitwise: Collaborative Edge-Cloud Inference for LLMs via Lyapunov-Assisted DRL

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Do explanations generalize across large reasoning models?

[Paper] Building Production-Ready Probes For Gemini

[Paper] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

[Paper] MetaboNet: The Largest Publicly Available Consolidated Dataset for Type 1 Diabetes Management