[Paper] Optimizing Resource Allocation for Geographically-Distributed Inference by Large Language Models

Published: 1 month ago (December 26, 2025 at 01:13 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.21884v1

Overview

Large language models (LLMs) deliver impressive AI capabilities, but running inference on them remains costly because they need powerful GPUs. The PETALS system showed that you can split an LLM across many low‑end GPUs spread over the Internet, but the speed you get hinges on where each model block lives and how inference requests are routed. This paper presents the first systematic study of that resource‑allocation problem, offering provably good algorithms and a lightweight simulator that lets developers experiment without a GPU farm.

Key Contributions

Performance models that accurately predict inference latency for any block‑placement + routing configuration, validated on real PETALS deployments.
Formal problem formulation: block placement + request routing cast as a mixed‑integer linear program (MILP) and proven NP‑hard.
Polynomial‑time algorithm with a guaranteed approximation ratio for the offline (static) allocation problem.
Online adaptation that reacts to incoming request streams while preserving the same performance bound under bounded load.
CPU‑only simulator that mimics distributed LLM inference on GPU servers, enabling large‑scale “what‑if” studies without expensive hardware.

Methodology

System Modeling – The authors break down an LLM inference pipeline into blocks (e.g., transformer layers) that can be placed on any server. They capture two latency sources: (a) computation latency (depends on the server’s GPU speed) and (b) communication latency (network round‑trip time between servers).
Empirical Calibration – By running micro‑benchmarks on a handful of heterogeneous machines, they fit simple linear models that map block‑size and network bandwidth to latency. The models are then cross‑validated on unseen placements to ensure reliability.
Optimization Formulation – The placement‑routing decision is expressed as a MILP: binary variables indicate whether a block resides on a server, and flow variables encode how a request traverses the blocks. The objective minimizes the worst‑case (or average) inference time.
Algorithm Design – Because solving the MILP exactly is intractable for realistic clusters, they develop a greedy‑plus‑local‑search heuristic that runs in polynomial time and provably stays within a constant factor of the optimal solution.
Online Extension – The offline solution is turned into an online scheduler by periodically re‑optimizing with the current load snapshot; a theoretical analysis shows the same approximation guarantee holds as long as the load does not spike beyond a known bound.
Simulation Platform – A lightweight CPU‑only simulator implements the calibrated performance models, allowing the authors to evaluate thousands of placement scenarios quickly and to compare against the state‑of‑the‑art PETALS scheduler.

Results & Findings

Metric	Baseline (PETALS default)	Proposed Offline Algo	Proposed Online Algo
95th‑percentile latency (ms)	420	268 (≈ 36 % reduction)	285 (≈ 32 % reduction)
Average throughput (req/s)	12	18 (≈ 50 % boost)	17
Scheduler runtime (s)	–	3.2 (for 50‑node cluster)	0.9 (per re‑schedule)
Simulation error vs. real run	±12 %	±4 % (validated)	–

Key takeaways

The calibrated models predict latency within ±5 % across diverse geographic setups.
Even a modest‑size cluster (≈ 30 low‑end GPUs) sees 30‑40 % latency cuts when using the optimized placement.
The online scheduler reacts to workload changes within seconds and maintains the same performance guarantee, proving that static planning is not a hard requirement.

Practical Implications

Cost‑effective LLM serving – Companies can spin up a “GPU‑pool” of inexpensive machines (e.g., consumer‑grade RTX 3060) across data‑center regions and still achieve near‑optimal latency, reducing cloud GPU spend by up to 40 %.
Edge‑aware AI – Developers building latency‑sensitive applications (e.g., real‑time code assistants, chatbots) can place the most compute‑heavy blocks closer to users, while routing lighter blocks to cheaper back‑ends, balancing speed and cost.
Simplified DevOps – The open‑source simulator lets teams evaluate “what‑if” scenarios (adding a new node, changing bandwidth) without provisioning hardware, accelerating capacity planning.
Framework integration – The algorithms are lightweight enough to be embedded into existing model‑parallel runtimes (e.g., DeepSpeed, Megatron‑LM) as a plug‑in scheduler, offering immediate performance gains.

Limitations & Future Work

Static network assumptions – The models treat network latency/bandwidth as fixed per link; real‑world congestion could violate this, requiring adaptive measurement.
Homogeneous block granularity – The study assumes each transformer layer is a block; more fine‑grained partitioning (e.g., sub‑layer sharding) may unlock further gains but complicates the optimization.
Scalability to massive clusters – While the polynomial algorithm scales to a few dozen nodes, handling hundreds of heterogeneous servers may need additional hierarchical or distributed heuristics.
Security & privacy – Distributing model blocks across public networks raises concerns about model leakage; future work could explore encrypted inference or secure multi‑party computation in this context.

Overall, this work provides a concrete, mathematically grounded toolkit for anyone looking to serve large language models at scale without breaking the bank.

Authors

Tingyang Sun
Ting He
Bo Ji
Parimal Parag

Paper Information

arXiv ID: 2512.21884v1
Categories: cs.DC, cs.AI, cs.NI
Published: December 26, 2025
PDF: Download PDF

[Paper] Optimizing Resource Allocation for Geographically-Distributed Inference by Large Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Agentic Structured Graph Traversal for Root Cause Analysis of Code-related Incidents in Cloud Applications

[Paper] Pruning as a Game: Equilibrium-Driven Sparsification of Neural Networks

[Paper] Explainable Multimodal Regression via Information Decomposition

[Paper] A2P-Vis: an Analyzer-to-Presenter Agentic Pipeline for Visual Insights Generation and Reporting