[Paper] FlowPrefill: Decoupling Preemption from Prefill Scheduling Granularity to Mitigate Head-of-Line Blocking in LLM Serving
Source: arXiv - 2602.16603v1
Overview
Large language models (LLMs) are being served at massive scale, and the “prefill” phase—where the model processes the initial prompt—often becomes a bottleneck. Long‑running prompts can block shorter, higher‑priority requests, causing time‑to‑first‑token (TTFT) violations. The paper FlowPrefill proposes a new serving architecture that separates when a request can be preempted from how finely the work is chunked, dramatically reducing head‑of‑line (HoL) blocking while keeping throughput high.
Key Contributions
- Operator‑Level Preemption: Uses natural boundaries between model operators (e.g., attention, feed‑forward) to pause and resume a request without resorting to tiny, inefficient chunks.
- Event‑Driven Scheduling: Scheduling decisions are made only on request arrivals or completions, eliminating the constant polling overhead of traditional schedulers.
- TTFT‑Goodput Optimizer: A lightweight runtime that dynamically balances latency (TTFT) against overall goodput, respecting heterogeneous SLOs across requests.
- Real‑World Evaluation: Experiments on production‑grade traces show up to 5.6× improvement in maximum goodput compared with the best existing LLM serving systems, while meeting diverse latency targets.
Methodology
- Decoupling Granularity from Frequency – Instead of fixing a chunk size (e.g., “process 10 tokens then check for preemption”), FlowPrefill lets the scheduler decide when to intervene (event‑driven) and where to intervene (operator boundaries).
- Operator‑Level Hooks – The authors instrument the transformer implementation so that each operator can be safely checkpointed and resumed. This yields fine‑grained preemption without the compute waste of small token‑level chunks.
- Scheduler Logic – A central controller maintains a priority queue of pending requests. When a new request arrives or an existing one finishes an operator, the scheduler re‑evaluates which request should run next, favoring those with tighter TTFT SLOs.
- Simulation & Trace Replay – They replayed real traffic logs from a production LLM service, injecting a mix of short and long prompts with varying latency budgets, and compared FlowPrefill against baseline chunked‑prefill and state‑of‑the‑art systems (e.g., vLLM, TGI).
Results & Findings
| Metric | Baseline (Chunked Prefill) | FlowPrefill |
|---|---|---|
| Max goodput (requests/s) | 1.0× (reference) | 5.6× |
| 99‑th‑percentile TTFT SLO violations | 38 % | 7 % |
| Average GPU utilization | 68 % | 92 % |
| Scheduler overhead (CPU %) | 12 % | 3 % |
- Latency vs. Throughput Trade‑off Resolved: By preempting at operator boundaries, FlowPrefill keeps GPUs busy (high utilization) while still pulling high‑priority requests forward.
- Control‑Plane Efficiency: Event‑driven scheduling reduces scheduler wake‑ups by >80 %, freeing CPU cycles for other serving tasks.
- SLO Heterogeneity: The system automatically honors mixed latency budgets (e.g., interactive chat vs. batch summarization) without manual tuning.
Practical Implications
- For Cloud LLM Providers: Deploying FlowPrefill can increase the number of concurrent users per GPU, lowering cost per token and improving SLA compliance.
- For Application Developers: The API surface remains unchanged; developers only need to specify a TTFT deadline per request, and the runtime handles the rest.
- Edge / On‑Device Inference: The low‑overhead scheduler makes it feasible to run multi‑tenant LLM services on limited hardware (e.g., inference accelerators in data‑center pods).
- Integration Path: FlowPrefill’s operator‑level hooks are implemented as thin wrappers around existing PyTorch/Transformers kernels, meaning it can be dropped into existing serving stacks (vLLM, TGI, Triton) with minimal code changes.
Limitations & Future Work
- Model Compatibility: The current prototype targets standard transformer architectures; exotic models (e.g., mixture‑of‑experts, recurrent adapters) may require additional operator instrumentation.
- Memory Overhead: Checkpointing operator state incurs modest extra GPU memory (≈5 %); extremely large context windows could push memory limits.
- Distributed Scaling: The paper focuses on single‑node GPU scheduling. Extending the event‑driven preemption logic across multi‑node clusters (with network latency considerations) is left for future research.
- Adaptive Learning: Future versions could incorporate reinforcement‑learning‑based SLO prediction to further refine preemption decisions under highly volatile workloads.
Authors
- Chia-chi Hsieh
- Zan Zong
- Xinyang Chen
- Jianjiang Li
- Jidong Zhai
- Lijie Wen
Paper Information
- arXiv ID: 2602.16603v1
- Categories: cs.DC, cs.AI
- Published: February 18, 2026
- PDF: Download PDF