[Paper] L4: Low-Latency and Load-Balanced LLM Serving via Length-Aware Scheduling

Published: 6 days ago (December 22, 2025 at 04:13 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.19179v1

Overview

The paper introduces L4, a new runtime system that dramatically speeds up serving large language models (LLMs) on GPUs by being aware of request length. As modern LLMs can handle context windows of 100 k+ tokens, mixing short and long queries in the same GPU batch wastes compute and inflates latency. L4 tackles this by dynamically routing requests to length‑specialized instances, turning a heterogeneous workload into a smooth pipeline.

Key Contributions

Length‑aware scheduling: A novel scheduler that groups inference instances by the token length of incoming requests, reducing per‑instance heterogeneity.
Dynamic programming partitioning: An efficient algorithm that determines the optimal number of length‑specialized stages to maximize overall QoE (latency + throughput).
Runtime range refinement & decentralized rebalancing: Continuous adjustment of length boundaries and load distribution both across groups and within each group, without a central bottleneck.
Comprehensive evaluation: Demonstrates up to 67 % lower end‑to‑end latency, 69 % lower tail latency, and 2.89× higher throughput versus the best existing multi‑instance schedulers.

Methodology

Instance Grouping – L4 launches several identical GPU instances of the same LLM. Each instance is assigned a length range (e.g., 0‑2 k tokens, 2‑8 k tokens, …).
Dynamic Partitioning – Using a lightweight dynamic‑programming (DP) solver, L4 periodically recomputes the optimal set of length ranges based on the current request distribution and a QoE objective that balances latency and throughput.
Pipeline Flow – As a request arrives, it is first placed in the group whose range best matches its token count. If the request grows (e.g., due to a follow‑up) it can be promoted to a later stage, forming a natural pipeline across groups.
Load Balancing – Within each group, a decentralized balancer monitors GPU utilization and migrates requests between instances to avoid hotspots. Across groups, L4 refines the length boundaries to keep each group equally busy.
Implementation – Built on top of a standard inference engine (e.g., vLLM), L4 adds only a thin scheduling layer, keeping the core model execution untouched.

Results & Findings

Metric	Baseline (state‑of‑the‑art)	L4
Avg. end‑to‑end latency	1.2 s	0.4 s (‑67 %)
99‑th‑percentile latency	2.8 s	0.9 s (‑69 %)
Throughput (queries/s)	120	350 (×2.89)
GPU utilization	~55 %	~92 %

Length heterogeneity is the dominant bottleneck when context windows exceed ~32 k tokens.
Dynamic partitioning adapts to workload shifts (e.g., sudden influx of long documents) within a few seconds, keeping QoE stable.
Decentralized rebalancing eliminates a single point of failure and scales to dozens of GPU instances without noticeable overhead.

Practical Implications

Lower operational cost – Higher GPU utilization means fewer GPUs are needed to meet the same SLA, directly cutting cloud spend.
Better user experience – Faster response times, especially for long‑context queries (code generation, document summarization), improve perceived quality.
Simplified capacity planning – Operators can rely on L4’s self‑tuning to handle mixed workloads without manually sizing separate “short‑query” and “long‑query” clusters.
Plug‑and‑play integration – Since L4 sits on top of existing inference servers, teams can adopt it with minimal code changes, preserving existing model pipelines and monitoring tools.
Enables new services – Applications that were previously avoided due to latency (e.g., real‑time legal document analysis, multi‑turn chat with 100 k token context) become feasible.

Limitations & Future Work

Assumes static model weights – L4 does not yet handle on‑the‑fly model updates (e.g., LoRA fine‑tuning) that could change per‑token compute cost.
GPU‑only focus – The current design targets homogeneous GPU clusters; extending to heterogeneous accelerators (TPUs, CPUs) is left for later.
Scheduling overhead – While lightweight, the DP solver and range refinement add a small constant latency; future work could explore fully asynchronous or learning‑based schedulers.
Security & isolation – Multi‑tenant scenarios may need additional sandboxing to prevent cross‑request interference, which L4 does not address out of the box.

Bottom line: L4 shows that a modest, length‑aware scheduling layer can unlock dramatic performance gains for LLM serving, making large‑context applications more affordable and responsive for developers and enterprises alike.

Authors

Yitao Yuan
Chenqi Zhao
Bohan Zhao
Zane Cao
Yongchao He
Wenfei Wu

Paper Information

arXiv ID: 2512.19179v1
Categories: cs.DC
Published: December 22, 2025
PDF: Download PDF

[Paper] L4: Low-Latency and Load-Balanced LLM Serving via Length-Aware Scheduling

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Declarative distributed broadcast using three-valued modal logic and semitopologies

[Paper] ESCHER: Efficient and Scalable Hypergraph Evolution Representation with Application to Triad Counting

[Paper] Diving into 3D Parallelism with Heterogeneous Spot Instance GPUs: Design and Implications

[Paper] Stochastic well-structured transition systems