[Paper] L4: Low-Latency and Load-Balanced LLM Serving via Length-Aware Scheduling

Published: (December 22, 2025 at 04:13 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.19179v1

Overview

The paper introduces L4, a new runtime system that dramatically speeds up serving large language models (LLMs) on GPUs by being aware of request length. As modern LLMs can handle context windows of 100 k+ tokens, mixing short and long queries in the same GPU batch wastes compute and inflates latency. L4 tackles this by dynamically routing requests to length‑specialized instances, turning a heterogeneous workload into a smooth pipeline.

Key Contributions

  • Length‑aware scheduling: A novel scheduler that groups inference instances by the token length of incoming requests, reducing per‑instance heterogeneity.
  • Dynamic programming partitioning: An efficient algorithm that determines the optimal number of length‑specialized stages to maximize overall QoE (latency + throughput).
  • Runtime range refinement & decentralized rebalancing: Continuous adjustment of length boundaries and load distribution both across groups and within each group, without a central bottleneck.
  • Comprehensive evaluation: Demonstrates up to 67 % lower end‑to‑end latency, 69 % lower tail latency, and 2.89× higher throughput versus the best existing multi‑instance schedulers.

Methodology

  1. Instance Grouping – L4 launches several identical GPU instances of the same LLM. Each instance is assigned a length range (e.g., 0‑2 k tokens, 2‑8 k tokens, …).
  2. Dynamic Partitioning – Using a lightweight dynamic‑programming (DP) solver, L4 periodically recomputes the optimal set of length ranges based on the current request distribution and a QoE objective that balances latency and throughput.
  3. Pipeline Flow – As a request arrives, it is first placed in the group whose range best matches its token count. If the request grows (e.g., due to a follow‑up) it can be promoted to a later stage, forming a natural pipeline across groups.
  4. Load Balancing – Within each group, a decentralized balancer monitors GPU utilization and migrates requests between instances to avoid hotspots. Across groups, L4 refines the length boundaries to keep each group equally busy.
  5. Implementation – Built on top of a standard inference engine (e.g., vLLM), L4 adds only a thin scheduling layer, keeping the core model execution untouched.

Results & Findings

MetricBaseline (state‑of‑the‑art)L4
Avg. end‑to‑end latency1.2 s0.4 s (‑67 %)
99‑th‑percentile latency2.8 s0.9 s (‑69 %)
Throughput (queries/s)120350 (×2.89)
GPU utilization~55 %~92 %
  • Length heterogeneity is the dominant bottleneck when context windows exceed ~32 k tokens.
  • Dynamic partitioning adapts to workload shifts (e.g., sudden influx of long documents) within a few seconds, keeping QoE stable.
  • Decentralized rebalancing eliminates a single point of failure and scales to dozens of GPU instances without noticeable overhead.

Practical Implications

  • Lower operational cost – Higher GPU utilization means fewer GPUs are needed to meet the same SLA, directly cutting cloud spend.
  • Better user experience – Faster response times, especially for long‑context queries (code generation, document summarization), improve perceived quality.
  • Simplified capacity planning – Operators can rely on L4’s self‑tuning to handle mixed workloads without manually sizing separate “short‑query” and “long‑query” clusters.
  • Plug‑and‑play integration – Since L4 sits on top of existing inference servers, teams can adopt it with minimal code changes, preserving existing model pipelines and monitoring tools.
  • Enables new services – Applications that were previously avoided due to latency (e.g., real‑time legal document analysis, multi‑turn chat with 100 k token context) become feasible.

Limitations & Future Work

  • Assumes static model weights – L4 does not yet handle on‑the‑fly model updates (e.g., LoRA fine‑tuning) that could change per‑token compute cost.
  • GPU‑only focus – The current design targets homogeneous GPU clusters; extending to heterogeneous accelerators (TPUs, CPUs) is left for later.
  • Scheduling overhead – While lightweight, the DP solver and range refinement add a small constant latency; future work could explore fully asynchronous or learning‑based schedulers.
  • Security & isolation – Multi‑tenant scenarios may need additional sandboxing to prevent cross‑request interference, which L4 does not address out of the box.

Bottom line: L4 shows that a modest, length‑aware scheduling layer can unlock dramatic performance gains for LLM serving, making large‑context applications more affordable and responsive for developers and enterprises alike.

Authors

  • Yitao Yuan
  • Chenqi Zhao
  • Bohan Zhao
  • Zane Cao
  • Yongchao He
  • Wenfei Wu

Paper Information

  • arXiv ID: 2512.19179v1
  • Categories: cs.DC
  • Published: December 22, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »