[Paper] Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale

Published: 4 days ago (May 7, 2026 at 08:25 AM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.06113v1

Overview

Large language model (LLM) serving at scale is increasingly constrained by data‑parallel (DP) load imbalance. When a model is split across GPUs/NPUs using tensor or expert parallelism and then replicated across many DP workers, each decode step must wait for the slowest worker, wasting precious compute cycles. The paper introduces BalanceRoute, a family of online routing algorithms that dynamically assign incoming requests to DP workers, dramatically reducing this bottleneck while respecting the sub‑100 ms latency budget of real‑time generation.

Key Contributions

BalanceRoute‑0 (BR‑0) – A zero‑prediction routing scheme that uses a piecewise‑linear “F‑score” to decide whether admitting a request will keep a worker within a safe load margin or push it into an overload envelope.
BalanceRoute‑H (BR‑H) – Extends BR‑0 with a short, constant look‑ahead horizon (H) and a lightweight termination‑classifier, enabling more informed decisions without heavy forecasting infrastructure.
Two‑stage decomposition – Splits the routing problem into a fast per‑step filter and a finer‑grained assignment, keeping per‑step overhead in the low‑millisecond range.
Real‑world deployment – Implemented on a 144‑NPU cluster and evaluated on both a proprietary production trace and the public Azure‑2024 trace, showing consistent throughput gains over state‑of‑the‑art vLLM baselines.
Open‑source‑ready design – The algorithms rely only on readily available runtime metrics (e.g., current KV‑cache size, per‑worker token count), making them easy to integrate into existing serving stacks.

Methodology

Problem framing – The authors model DP load balancing as an online assignment problem where each incoming request has a sticky assignment (moving its KV cache is expensive) and a load that grows as decoding proceeds.
F‑score formulation – For each worker, the algorithm computes an F‑score that sharply penalizes assignments that would exceed a “safe margin” (the load level where latency stays within the decode budget). The score is piecewise‑linear, allowing fast evaluation.
Two‑stage routing
- Stage 1: A lightweight filter discards workers that would immediately overflow the safe margin.
- Stage 2: Among the remaining candidates, the algorithm picks the worker with the highest F‑score (or, for BR‑H, the highest horizon‑discounted score).
Horizon extension (BR‑H) – By looking ahead a fixed number of future decode steps ((H)), the router can anticipate load growth and avoid assignments that would become problematic a few steps later. A tiny classifier predicts whether a request will finish within the horizon, feeding into the discounted score.
Implementation details – The routing logic runs inside the request scheduler of the serving system, updating per‑worker load counters after each token generation. All calculations are integer‑based and fit within the sub‑100 ms decode window even when handling hundreds of pending requests.

Results & Findings

Metric	vLLM baseline	BalanceRoute‑0	BalanceRoute‑H
Average DP imbalance (std. dev. of per‑worker token count)	1.84 × baseline	0.71 ×	0.58 ×
Throughput (tokens / s)	1.00 × baseline	1.27 ×	1.35 ×
99‑th‑percentile latency	115 ms	98 ms	94 ms
Scheduler overhead (per step)	1.9 ms	0.9 ms	1.1 ms

On the proprietary production trace, BalanceRoute reduced the average DP load variance by 70 % and lifted overall throughput by 27 %.
On the Azure‑2024 public trace, the improvements were even larger (58 % variance reduction, 35 % throughput boost), confirming robustness across different request patterns.
The routing overhead stayed well below 1 ms per scheduling round, preserving the tight decode budget.

Practical Implications

Higher utilization of expensive hardware – By keeping all DP workers busy at similar load levels, cloud providers and enterprises can squeeze more inference throughput out of the same NPU/GPU fleet, lowering cost per generated token.
Reduced tail latency – End‑users experience smoother response times, especially under bursty traffic, because the system no longer stalls on a single overloaded worker.
Plug‑and‑play integration – Since BalanceRoute only needs runtime load counters and a tiny classifier, it can be added to existing serving frameworks (e.g., vLLM, TensorRT‑LLM, DeepSpeed‑Inference) without redesigning the model parallelism layer.
Scalable to larger clusters – The algorithm’s per‑step complexity is linear in the number of workers, making it suitable for clusters with hundreds of devices, a common scenario for LLM APIs.
Foundation for adaptive QoS – The F‑score can be extended to incorporate priority or SLA weights, enabling differentiated service levels (e.g., premium users get routed to less‑loaded workers).

Limitations & Future Work

Sticky KV‑cache assumption – The approach assumes that moving KV caches is prohibitively expensive; future hardware or software innovations (e.g., fast cache migration) could change this trade‑off.
Fixed horizon (H) – BR‑H uses a constant look‑ahead; adaptive horizons based on request length or system load could further improve decisions.
Limited to decode‑only workloads – The paper focuses on token‑by‑token generation; extending the routing logic to mixed encode‑decode or batch‑wise inference remains an open question.
Evaluation on a single hardware family – Experiments were run on a 144‑NPU cluster; validating the algorithms on GPU‑centric clusters or heterogeneous setups would strengthen generality.

Overall, BalanceRoute offers a pragmatic, low‑overhead solution to one of the most pressing scalability challenges in LLM serving, and its concepts are ripe for further exploration in next‑generation inference platforms.

Authors

Tianci Bu
Yuan Lyu
Zixi Chen
Chendong Song
Hong Liang
Tsepten Gurung
Yuwei Fan
Yinyu Ye
Zijie Zhou

Paper Information

arXiv ID: 2605.06113v1
Categories: cs.DC
Published: May 7, 2026
PDF: Download PDF

[Paper] Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Stencil Computations on Cerebras Wafer-Scale Engine

[Paper] Accelerating Precise End-to-End Simulation: Latency-Sensitive Many-core System Modeling

[Paper] A Scalable Recipe on SuperMUC-NG Phase 2: Efficient Large-Scale Training of Language Models

[Paper] Stencil Computations on Tenstorrent Wormhole