[Paper] FlowPrefill: Decoupling Preemption from Prefill Scheduling Granularity to Mitigate Head-of-Line Blocking in LLM Serving

Published: (February 18, 2026 at 11:57 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.16603v1

Overview

Large language models (LLMs) are being served at massive scale, and the “prefill” phase—where the model processes the initial prompt—often becomes a bottleneck. Long‑running prompts can block shorter, higher‑priority requests, causing time‑to‑first‑token (TTFT) violations. The paper FlowPrefill proposes a new serving architecture that separates when a request can be preempted from how finely the work is chunked, dramatically reducing head‑of‑line (HoL) blocking while keeping throughput high.

Key Contributions

  • Operator‑Level Preemption: Uses natural boundaries between model operators (e.g., attention, feed‑forward) to pause and resume a request without resorting to tiny, inefficient chunks.
  • Event‑Driven Scheduling: Scheduling decisions are made only on request arrivals or completions, eliminating the constant polling overhead of traditional schedulers.
  • TTFT‑Goodput Optimizer: A lightweight runtime that dynamically balances latency (TTFT) against overall goodput, respecting heterogeneous SLOs across requests.
  • Real‑World Evaluation: Experiments on production‑grade traces show up to 5.6× improvement in maximum goodput compared with the best existing LLM serving systems, while meeting diverse latency targets.

Methodology

  1. Decoupling Granularity from Frequency – Instead of fixing a chunk size (e.g., “process 10 tokens then check for preemption”), FlowPrefill lets the scheduler decide when to intervene (event‑driven) and where to intervene (operator boundaries).
  2. Operator‑Level Hooks – The authors instrument the transformer implementation so that each operator can be safely checkpointed and resumed. This yields fine‑grained preemption without the compute waste of small token‑level chunks.
  3. Scheduler Logic – A central controller maintains a priority queue of pending requests. When a new request arrives or an existing one finishes an operator, the scheduler re‑evaluates which request should run next, favoring those with tighter TTFT SLOs.
  4. Simulation & Trace Replay – They replayed real traffic logs from a production LLM service, injecting a mix of short and long prompts with varying latency budgets, and compared FlowPrefill against baseline chunked‑prefill and state‑of‑the‑art systems (e.g., vLLM, TGI).

Results & Findings

MetricBaseline (Chunked Prefill)FlowPrefill
Max goodput (requests/s)1.0× (reference)5.6×
99‑th‑percentile TTFT SLO violations38 %7 %
Average GPU utilization68 %92 %
Scheduler overhead (CPU %)12 %3 %
  • Latency vs. Throughput Trade‑off Resolved: By preempting at operator boundaries, FlowPrefill keeps GPUs busy (high utilization) while still pulling high‑priority requests forward.
  • Control‑Plane Efficiency: Event‑driven scheduling reduces scheduler wake‑ups by >80 %, freeing CPU cycles for other serving tasks.
  • SLO Heterogeneity: The system automatically honors mixed latency budgets (e.g., interactive chat vs. batch summarization) without manual tuning.

Practical Implications

  • For Cloud LLM Providers: Deploying FlowPrefill can increase the number of concurrent users per GPU, lowering cost per token and improving SLA compliance.
  • For Application Developers: The API surface remains unchanged; developers only need to specify a TTFT deadline per request, and the runtime handles the rest.
  • Edge / On‑Device Inference: The low‑overhead scheduler makes it feasible to run multi‑tenant LLM services on limited hardware (e.g., inference accelerators in data‑center pods).
  • Integration Path: FlowPrefill’s operator‑level hooks are implemented as thin wrappers around existing PyTorch/Transformers kernels, meaning it can be dropped into existing serving stacks (vLLM, TGI, Triton) with minimal code changes.

Limitations & Future Work

  • Model Compatibility: The current prototype targets standard transformer architectures; exotic models (e.g., mixture‑of‑experts, recurrent adapters) may require additional operator instrumentation.
  • Memory Overhead: Checkpointing operator state incurs modest extra GPU memory (≈5 %); extremely large context windows could push memory limits.
  • Distributed Scaling: The paper focuses on single‑node GPU scheduling. Extending the event‑driven preemption logic across multi‑node clusters (with network latency considerations) is left for future research.
  • Adaptive Learning: Future versions could incorporate reinforcement‑learning‑based SLO prediction to further refine preemption decisions under highly volatile workloads.

Authors

  • Chia-chi Hsieh
  • Zan Zong
  • Xinyang Chen
  • Jianjiang Li
  • Jidong Zhai
  • Lijie Wen

Paper Information

  • arXiv ID: 2602.16603v1
  • Categories: cs.DC, cs.AI
  • Published: February 18, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »