[Paper] Optimizing Resource Allocation for Geographically-Distributed Inference by Large Language Models
Source: arXiv - 2512.21884v1
Overview
Large language models (LLMs) deliver impressive AI capabilities, but running inference on them remains costly because they need powerful GPUs. The PETALS system showed that you can split an LLM across many low‑end GPUs spread over the Internet, but the speed you get hinges on where each model block lives and how inference requests are routed. This paper presents the first systematic study of that resource‑allocation problem, offering provably good algorithms and a lightweight simulator that lets developers experiment without a GPU farm.
Key Contributions
- Performance models that accurately predict inference latency for any block‑placement + routing configuration, validated on real PETALS deployments.
- Formal problem formulation: block placement + request routing cast as a mixed‑integer linear program (MILP) and proven NP‑hard.
- Polynomial‑time algorithm with a guaranteed approximation ratio for the offline (static) allocation problem.
- Online adaptation that reacts to incoming request streams while preserving the same performance bound under bounded load.
- CPU‑only simulator that mimics distributed LLM inference on GPU servers, enabling large‑scale “what‑if” studies without expensive hardware.
Methodology
- System Modeling – The authors break down an LLM inference pipeline into blocks (e.g., transformer layers) that can be placed on any server. They capture two latency sources: (a) computation latency (depends on the server’s GPU speed) and (b) communication latency (network round‑trip time between servers).
- Empirical Calibration – By running micro‑benchmarks on a handful of heterogeneous machines, they fit simple linear models that map block‑size and network bandwidth to latency. The models are then cross‑validated on unseen placements to ensure reliability.
- Optimization Formulation – The placement‑routing decision is expressed as a MILP: binary variables indicate whether a block resides on a server, and flow variables encode how a request traverses the blocks. The objective minimizes the worst‑case (or average) inference time.
- Algorithm Design – Because solving the MILP exactly is intractable for realistic clusters, they develop a greedy‑plus‑local‑search heuristic that runs in polynomial time and provably stays within a constant factor of the optimal solution.
- Online Extension – The offline solution is turned into an online scheduler by periodically re‑optimizing with the current load snapshot; a theoretical analysis shows the same approximation guarantee holds as long as the load does not spike beyond a known bound.
- Simulation Platform – A lightweight CPU‑only simulator implements the calibrated performance models, allowing the authors to evaluate thousands of placement scenarios quickly and to compare against the state‑of‑the‑art PETALS scheduler.
Results & Findings
| Metric | Baseline (PETALS default) | Proposed Offline Algo | Proposed Online Algo |
|---|---|---|---|
| 95th‑percentile latency (ms) | 420 | 268 (≈ 36 % reduction) | 285 (≈ 32 % reduction) |
| Average throughput (req/s) | 12 | 18 (≈ 50 % boost) | 17 |
| Scheduler runtime (s) | – | 3.2 (for 50‑node cluster) | 0.9 (per re‑schedule) |
| Simulation error vs. real run | ±12 % | ±4 % (validated) | – |
Key takeaways
- The calibrated models predict latency within ±5 % across diverse geographic setups.
- Even a modest‑size cluster (≈ 30 low‑end GPUs) sees 30‑40 % latency cuts when using the optimized placement.
- The online scheduler reacts to workload changes within seconds and maintains the same performance guarantee, proving that static planning is not a hard requirement.
Practical Implications
- Cost‑effective LLM serving – Companies can spin up a “GPU‑pool” of inexpensive machines (e.g., consumer‑grade RTX 3060) across data‑center regions and still achieve near‑optimal latency, reducing cloud GPU spend by up to 40 %.
- Edge‑aware AI – Developers building latency‑sensitive applications (e.g., real‑time code assistants, chatbots) can place the most compute‑heavy blocks closer to users, while routing lighter blocks to cheaper back‑ends, balancing speed and cost.
- Simplified DevOps – The open‑source simulator lets teams evaluate “what‑if” scenarios (adding a new node, changing bandwidth) without provisioning hardware, accelerating capacity planning.
- Framework integration – The algorithms are lightweight enough to be embedded into existing model‑parallel runtimes (e.g., DeepSpeed, Megatron‑LM) as a plug‑in scheduler, offering immediate performance gains.
Limitations & Future Work
- Static network assumptions – The models treat network latency/bandwidth as fixed per link; real‑world congestion could violate this, requiring adaptive measurement.
- Homogeneous block granularity – The study assumes each transformer layer is a block; more fine‑grained partitioning (e.g., sub‑layer sharding) may unlock further gains but complicates the optimization.
- Scalability to massive clusters – While the polynomial algorithm scales to a few dozen nodes, handling hundreds of heterogeneous servers may need additional hierarchical or distributed heuristics.
- Security & privacy – Distributing model blocks across public networks raises concerns about model leakage; future work could explore encrypted inference or secure multi‑party computation in this context.
Overall, this work provides a concrete, mathematically grounded toolkit for anyone looking to serve large language models at scale without breaking the bank.
Authors
- Tingyang Sun
- Ting He
- Bo Ji
- Parimal Parag
Paper Information
- arXiv ID: 2512.21884v1
- Categories: cs.DC, cs.AI, cs.NI
- Published: December 26, 2025
- PDF: Download PDF