[Paper] FlowPrefill: Decoupling Preemption from Prefill Scheduling Granularity to Mitigate Head-of-Line Blocking in LLM Serving

Published: 2 months ago (February 18, 2026 at 11:57 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.16603v1

Overview

Large language models (LLMs) are being served at massive scale, and the “prefill” phase—where the model processes the initial prompt—often becomes a bottleneck. Long‑running prompts can block shorter, higher‑priority requests, causing time‑to‑first‑token (TTFT) violations. The paper FlowPrefill proposes a new serving architecture that separates when a request can be preempted from how finely the work is chunked, dramatically reducing head‑of‑line (HoL) blocking while keeping throughput high.

Key Contributions

Operator‑Level Preemption: Uses natural boundaries between model operators (e.g., attention, feed‑forward) to pause and resume a request without resorting to tiny, inefficient chunks.
Event‑Driven Scheduling: Scheduling decisions are made only on request arrivals or completions, eliminating the constant polling overhead of traditional schedulers.
TTFT‑Goodput Optimizer: A lightweight runtime that dynamically balances latency (TTFT) against overall goodput, respecting heterogeneous SLOs across requests.
Real‑World Evaluation: Experiments on production‑grade traces show up to 5.6× improvement in maximum goodput compared with the best existing LLM serving systems, while meeting diverse latency targets.

Methodology

Decoupling Granularity from Frequency – Instead of fixing a chunk size (e.g., “process 10 tokens then check for preemption”), FlowPrefill lets the scheduler decide when to intervene (event‑driven) and where to intervene (operator boundaries).
Operator‑Level Hooks – The authors instrument the transformer implementation so that each operator can be safely checkpointed and resumed. This yields fine‑grained preemption without the compute waste of small token‑level chunks.
Scheduler Logic – A central controller maintains a priority queue of pending requests. When a new request arrives or an existing one finishes an operator, the scheduler re‑evaluates which request should run next, favoring those with tighter TTFT SLOs.
Simulation & Trace Replay – They replayed real traffic logs from a production LLM service, injecting a mix of short and long prompts with varying latency budgets, and compared FlowPrefill against baseline chunked‑prefill and state‑of‑the‑art systems (e.g., vLLM, TGI).

Results & Findings

Metric	Baseline (Chunked Prefill)	FlowPrefill
Max goodput (requests/s)	1.0× (reference)	5.6×
99‑th‑percentile TTFT SLO violations	38 %	7 %
Average GPU utilization	68 %	92 %
Scheduler overhead (CPU %)	12 %	3 %

Latency vs. Throughput Trade‑off Resolved: By preempting at operator boundaries, FlowPrefill keeps GPUs busy (high utilization) while still pulling high‑priority requests forward.
Control‑Plane Efficiency: Event‑driven scheduling reduces scheduler wake‑ups by >80 %, freeing CPU cycles for other serving tasks.
SLO Heterogeneity: The system automatically honors mixed latency budgets (e.g., interactive chat vs. batch summarization) without manual tuning.

Practical Implications

For Cloud LLM Providers: Deploying FlowPrefill can increase the number of concurrent users per GPU, lowering cost per token and improving SLA compliance.
For Application Developers: The API surface remains unchanged; developers only need to specify a TTFT deadline per request, and the runtime handles the rest.
Edge / On‑Device Inference: The low‑overhead scheduler makes it feasible to run multi‑tenant LLM services on limited hardware (e.g., inference accelerators in data‑center pods).
Integration Path: FlowPrefill’s operator‑level hooks are implemented as thin wrappers around existing PyTorch/Transformers kernels, meaning it can be dropped into existing serving stacks (vLLM, TGI, Triton) with minimal code changes.

Limitations & Future Work

Model Compatibility: The current prototype targets standard transformer architectures; exotic models (e.g., mixture‑of‑experts, recurrent adapters) may require additional operator instrumentation.
Memory Overhead: Checkpointing operator state incurs modest extra GPU memory (≈5 %); extremely large context windows could push memory limits.
Distributed Scaling: The paper focuses on single‑node GPU scheduling. Extending the event‑driven preemption logic across multi‑node clusters (with network latency considerations) is left for future research.
Adaptive Learning: Future versions could incorporate reinforcement‑learning‑based SLO prediction to further refine preemption decisions under highly volatile workloads.

Authors

Chia-chi Hsieh
Zan Zong
Xinyang Chen
Jianjiang Li
Jidong Zhai
Lijie Wen

Paper Information

arXiv ID: 2602.16603v1
Categories: cs.DC, cs.AI
Published: February 18, 2026
PDF: Download PDF

[Paper] FlowPrefill: Decoupling Preemption from Prefill Scheduling Granularity to Mitigate Head-of-Line Blocking in LLM Serving

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] The Geometry of Noise: Why Diffusion Models Don't Need Noise Conditioning

[Paper] Subgroups of $U(d)$ Induce Natural RNN and Transformer Architectures

[Paper] Unifying approach to uniform expressivity of graph neural networks

[Paper] Latent Equivariant Operators for Robust Object Recognition: Promise and Challenges