[Paper] L4: Low-Latency and Load-Balanced LLM Serving via Length-Aware Scheduling
Source: arXiv - 2512.19179v1
Overview
The paper introduces L4, a new runtime system that dramatically speeds up serving large language models (LLMs) on GPUs by being aware of request length. As modern LLMs can handle context windows of 100 k+ tokens, mixing short and long queries in the same GPU batch wastes compute and inflates latency. L4 tackles this by dynamically routing requests to length‑specialized instances, turning a heterogeneous workload into a smooth pipeline.
Key Contributions
- Length‑aware scheduling: A novel scheduler that groups inference instances by the token length of incoming requests, reducing per‑instance heterogeneity.
- Dynamic programming partitioning: An efficient algorithm that determines the optimal number of length‑specialized stages to maximize overall QoE (latency + throughput).
- Runtime range refinement & decentralized rebalancing: Continuous adjustment of length boundaries and load distribution both across groups and within each group, without a central bottleneck.
- Comprehensive evaluation: Demonstrates up to 67 % lower end‑to‑end latency, 69 % lower tail latency, and 2.89× higher throughput versus the best existing multi‑instance schedulers.
Methodology
- Instance Grouping – L4 launches several identical GPU instances of the same LLM. Each instance is assigned a length range (e.g., 0‑2 k tokens, 2‑8 k tokens, …).
- Dynamic Partitioning – Using a lightweight dynamic‑programming (DP) solver, L4 periodically recomputes the optimal set of length ranges based on the current request distribution and a QoE objective that balances latency and throughput.
- Pipeline Flow – As a request arrives, it is first placed in the group whose range best matches its token count. If the request grows (e.g., due to a follow‑up) it can be promoted to a later stage, forming a natural pipeline across groups.
- Load Balancing – Within each group, a decentralized balancer monitors GPU utilization and migrates requests between instances to avoid hotspots. Across groups, L4 refines the length boundaries to keep each group equally busy.
- Implementation – Built on top of a standard inference engine (e.g., vLLM), L4 adds only a thin scheduling layer, keeping the core model execution untouched.
Results & Findings
| Metric | Baseline (state‑of‑the‑art) | L4 |
|---|---|---|
| Avg. end‑to‑end latency | 1.2 s | 0.4 s (‑67 %) |
| 99‑th‑percentile latency | 2.8 s | 0.9 s (‑69 %) |
| Throughput (queries/s) | 120 | 350 (×2.89) |
| GPU utilization | ~55 % | ~92 % |
- Length heterogeneity is the dominant bottleneck when context windows exceed ~32 k tokens.
- Dynamic partitioning adapts to workload shifts (e.g., sudden influx of long documents) within a few seconds, keeping QoE stable.
- Decentralized rebalancing eliminates a single point of failure and scales to dozens of GPU instances without noticeable overhead.
Practical Implications
- Lower operational cost – Higher GPU utilization means fewer GPUs are needed to meet the same SLA, directly cutting cloud spend.
- Better user experience – Faster response times, especially for long‑context queries (code generation, document summarization), improve perceived quality.
- Simplified capacity planning – Operators can rely on L4’s self‑tuning to handle mixed workloads without manually sizing separate “short‑query” and “long‑query” clusters.
- Plug‑and‑play integration – Since L4 sits on top of existing inference servers, teams can adopt it with minimal code changes, preserving existing model pipelines and monitoring tools.
- Enables new services – Applications that were previously avoided due to latency (e.g., real‑time legal document analysis, multi‑turn chat with 100 k token context) become feasible.
Limitations & Future Work
- Assumes static model weights – L4 does not yet handle on‑the‑fly model updates (e.g., LoRA fine‑tuning) that could change per‑token compute cost.
- GPU‑only focus – The current design targets homogeneous GPU clusters; extending to heterogeneous accelerators (TPUs, CPUs) is left for later.
- Scheduling overhead – While lightweight, the DP solver and range refinement add a small constant latency; future work could explore fully asynchronous or learning‑based schedulers.
- Security & isolation – Multi‑tenant scenarios may need additional sandboxing to prevent cross‑request interference, which L4 does not address out of the box.
Bottom line: L4 shows that a modest, length‑aware scheduling layer can unlock dramatic performance gains for LLM serving, making large‑context applications more affordable and responsive for developers and enterprises alike.
Authors
- Yitao Yuan
- Chenqi Zhao
- Bohan Zhao
- Zane Cao
- Yongchao He
- Wenfei Wu
Paper Information
- arXiv ID: 2512.19179v1
- Categories: cs.DC
- Published: December 22, 2025
- PDF: Download PDF