[Paper] Splitwise: Collaborative Edge-Cloud Inference for LLMs via Lyapunov-Assisted DRL

Published: (December 29, 2025 at 03:57 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.23310v1

Overview

Deploying today’s massive language models on edge devices (smartphones, IoT boards, autonomous robots) is a pain point: the models don’t fit in memory, and running them locally burns power. Cloud‑only inference solves the memory issue but adds latency, bandwidth costs, and can be unreliable when the network is flaky. Splitwise tackles this head‑on with a dynamic, fine‑grained edge‑cloud partitioning strategy that continuously adapts to workload and network conditions, delivering faster, greener inference without sacrificing model quality.

Key Contributions

  • Fine‑grained partitioning – Breaks transformer layers into attention heads and feed‑forward sub‑blocks, expanding the design space far beyond traditional layer‑wise splits.
  • Lyapunov‑assisted DRL controller – A hierarchical deep‑reinforcement‑learning policy, regularized by Lyapunov optimization, jointly minimizes latency, energy, and accuracy loss while guaranteeing queue stability under stochastic request arrivals.
  • Robust checkpoint & recovery – Introduces exponential‑backoff checkpointing to gracefully handle intermittent network failures.
  • Comprehensive evaluation – Real‑world experiments on Jetson Orin NX, Galaxy S23, and Raspberry Pi 5 using GPT‑2 (1.5 B), LLaMA‑7 B, and LLaMA‑13 B show up to 2.8× latency reduction and 41 % energy savings versus state‑of‑the‑art partitioners.
  • QoS guarantees – Cuts the 95th‑percentile latency by 53‑61 % compared with pure cloud inference while keeping model accuracy intact.

Methodology

  1. Model Decomposition – Each transformer layer is split into two logical sub‑blocks:

    • (a) Multi‑head self‑attention (MHA) heads
    • (b) Feed‑forward network (FFN)

    This yields many more placement options (e.g., some heads on edge, others on cloud).

  2. Hierarchical DRL Policy

    • High‑level agent decides how many sub‑blocks to offload based on current queue length, device battery, and network bandwidth.
    • Low‑level agent selects the exact sub‑blocks (which heads, which FFN slices) to place on edge vs. cloud.
  3. Lyapunov Optimization – A Lyapunov function measures system “drift” (queue growth). By minimizing a drift‑plus‑penalty term, the controller ensures the request queue stays stable (no unbounded backlog) while optimizing a weighted sum of latency, energy, and accuracy loss.

  4. Checkpointing & Recovery – After each inference step, a lightweight checkpoint is streamed to the cloud. If a transmission fails, the system backs off exponentially and retries, preventing total job failure.

  5. Training & Deployment – The DRL agents are trained offline on a simulated workload that mimics real‑world request patterns and bandwidth traces. The learned policy is then embedded as a lightweight runtime library on the edge device.

Results & Findings

PlatformModelBaseline (cloud‑only)SplitwiseLatency ↓Energy ↓95th‑pct Latency ↓
Jetson Orin NXLLaMA‑7B210 ms78 ms2.7×38 %58 %
Galaxy S23GPT‑2 1.5B180 ms65 ms2.8×41 %61 %
Raspberry Pi 5LLaMA‑13B420 ms150 ms2.8×35 %53 %
  • Accuracy remained within 0.2 % of the full‑cloud baseline, confirming that the fine‑grained split does not introduce noticeable quantization or approximation errors.
  • The DRL controller reacted to sudden bandwidth drops (e.g., from 30 Mbps to 5 Mbps) by shifting more heads to the edge, keeping tail latency low.
  • Checkpoint recovery added < 5 ms overhead even under a 30 % packet‑loss scenario.

Practical Implications

  • Edge‑first AI products – Mobile apps, AR/VR experiences, and robotics can now run sophisticated LLMs locally without sacrificing responsiveness or draining the battery.
  • Cost‑effective cloud usage – By offloading only the most compute‑heavy sub‑blocks, data‑center load and egress bandwidth bills drop dramatically.
  • Dynamic QoS provisioning – Service providers can embed Splitwise to guarantee latency SLAs even when users roam between Wi‑Fi and cellular networks.
  • Developer‑friendly SDK – The authors release a lightweight C++/Python library that abstracts the DRL policy behind a simple infer() call, making integration into existing pipelines painless.
  • Security & privacy – Sensitive prompt data can stay on‑device for the attention heads that process user‑specific context, reducing exposure to the cloud.

Limitations & Future Work

  • Training overhead – The DRL policy requires a simulated environment and several hours of training for each new model‑hardware combo, which may be a barrier for rapid prototyping.
  • Model size ceiling – Experiments stopped at 13 B parameters; scaling to 70 B‑class models may need additional hierarchical splitting (e.g., across multiple cloud nodes).
  • Network assumptions – The current design assumes a relatively stable TCP connection; bursty UDP‑based streaming or satellite links need separate robustness mechanisms.

Future directions proposed by the authors:

  1. Meta‑learning to transfer policies across models.
  2. Extending the framework to multi‑edge scenarios (e.g., edge‑to‑edge collaboration).
  3. Incorporating quantization‑aware splitting to push the memory envelope even further.

Authors

  • Abolfazl Younesi
  • Abbas Shabrang Maryan
  • Elyas Oustad
  • Zahra Najafabadi Samani
  • Mohsen Ansari
  • Thomas Fahringer

Paper Information

  • arXiv ID: 2512.23310v1
  • Categories: cs.LG, cs.AI, cs.DC, cs.ET, cs.NI
  • Published: December 29, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »