[Paper] Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale
Source: arXiv - 2605.06113v1
Overview
Large language model (LLM) serving at scale is increasingly constrained by data‑parallel (DP) load imbalance. When a model is split across GPUs/NPUs using tensor or expert parallelism and then replicated across many DP workers, each decode step must wait for the slowest worker, wasting precious compute cycles. The paper introduces BalanceRoute, a family of online routing algorithms that dynamically assign incoming requests to DP workers, dramatically reducing this bottleneck while respecting the sub‑100 ms latency budget of real‑time generation.
Key Contributions
- BalanceRoute‑0 (BR‑0) – A zero‑prediction routing scheme that uses a piecewise‑linear “F‑score” to decide whether admitting a request will keep a worker within a safe load margin or push it into an overload envelope.
- BalanceRoute‑H (BR‑H) – Extends BR‑0 with a short, constant look‑ahead horizon (H) and a lightweight termination‑classifier, enabling more informed decisions without heavy forecasting infrastructure.
- Two‑stage decomposition – Splits the routing problem into a fast per‑step filter and a finer‑grained assignment, keeping per‑step overhead in the low‑millisecond range.
- Real‑world deployment – Implemented on a 144‑NPU cluster and evaluated on both a proprietary production trace and the public Azure‑2024 trace, showing consistent throughput gains over state‑of‑the‑art vLLM baselines.
- Open‑source‑ready design – The algorithms rely only on readily available runtime metrics (e.g., current KV‑cache size, per‑worker token count), making them easy to integrate into existing serving stacks.
Methodology
- Problem framing – The authors model DP load balancing as an online assignment problem where each incoming request has a sticky assignment (moving its KV cache is expensive) and a load that grows as decoding proceeds.
- F‑score formulation – For each worker, the algorithm computes an F‑score that sharply penalizes assignments that would exceed a “safe margin” (the load level where latency stays within the decode budget). The score is piecewise‑linear, allowing fast evaluation.
- Two‑stage routing
- Stage 1: A lightweight filter discards workers that would immediately overflow the safe margin.
- Stage 2: Among the remaining candidates, the algorithm picks the worker with the highest F‑score (or, for BR‑H, the highest horizon‑discounted score).
- Horizon extension (BR‑H) – By looking ahead a fixed number of future decode steps ((H)), the router can anticipate load growth and avoid assignments that would become problematic a few steps later. A tiny classifier predicts whether a request will finish within the horizon, feeding into the discounted score.
- Implementation details – The routing logic runs inside the request scheduler of the serving system, updating per‑worker load counters after each token generation. All calculations are integer‑based and fit within the sub‑100 ms decode window even when handling hundreds of pending requests.
Results & Findings
| Metric | vLLM baseline | BalanceRoute‑0 | BalanceRoute‑H |
|---|---|---|---|
| Average DP imbalance (std. dev. of per‑worker token count) | 1.84 × baseline | 0.71 × | 0.58 × |
| Throughput (tokens / s) | 1.00 × baseline | 1.27 × | 1.35 × |
| 99‑th‑percentile latency | 115 ms | 98 ms | 94 ms |
| Scheduler overhead (per step) | 1.9 ms | 0.9 ms | 1.1 ms |
- On the proprietary production trace, BalanceRoute reduced the average DP load variance by 70 % and lifted overall throughput by 27 %.
- On the Azure‑2024 public trace, the improvements were even larger (58 % variance reduction, 35 % throughput boost), confirming robustness across different request patterns.
- The routing overhead stayed well below 1 ms per scheduling round, preserving the tight decode budget.
Practical Implications
- Higher utilization of expensive hardware – By keeping all DP workers busy at similar load levels, cloud providers and enterprises can squeeze more inference throughput out of the same NPU/GPU fleet, lowering cost per generated token.
- Reduced tail latency – End‑users experience smoother response times, especially under bursty traffic, because the system no longer stalls on a single overloaded worker.
- Plug‑and‑play integration – Since BalanceRoute only needs runtime load counters and a tiny classifier, it can be added to existing serving frameworks (e.g., vLLM, TensorRT‑LLM, DeepSpeed‑Inference) without redesigning the model parallelism layer.
- Scalable to larger clusters – The algorithm’s per‑step complexity is linear in the number of workers, making it suitable for clusters with hundreds of devices, a common scenario for LLM APIs.
- Foundation for adaptive QoS – The F‑score can be extended to incorporate priority or SLA weights, enabling differentiated service levels (e.g., premium users get routed to less‑loaded workers).
Limitations & Future Work
- Sticky KV‑cache assumption – The approach assumes that moving KV caches is prohibitively expensive; future hardware or software innovations (e.g., fast cache migration) could change this trade‑off.
- Fixed horizon (H) – BR‑H uses a constant look‑ahead; adaptive horizons based on request length or system load could further improve decisions.
- Limited to decode‑only workloads – The paper focuses on token‑by‑token generation; extending the routing logic to mixed encode‑decode or batch‑wise inference remains an open question.
- Evaluation on a single hardware family – Experiments were run on a 144‑NPU cluster; validating the algorithms on GPU‑centric clusters or heterogeneous setups would strengthen generality.
Overall, BalanceRoute offers a pragmatic, low‑overhead solution to one of the most pressing scalability challenges in LLM serving, and its concepts are ripe for further exploration in next‑generation inference platforms.
Authors
- Tianci Bu
- Yuan Lyu
- Zixi Chen
- Chendong Song
- Hong Liang
- Tsepten Gurung
- Yuwei Fan
- Yinyu Ye
- Zijie Zhou
Paper Information
- arXiv ID: 2605.06113v1
- Categories: cs.DC
- Published: May 7, 2026
- PDF: Download PDF