[Paper] EWSJF: An Adaptive Scheduler with Hybrid Partitioning for Mixed-Workload LLM Inference

Published: (January 29, 2026 at 09:14 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.21758v1

Overview

Large Language Model (LLM) serving platforms must juggle two very different kinds of traffic: short, interactive queries that need instant responses, and long, batch‑style requests that are more tolerant of latency but demand high throughput. The paper EWSJF: An Adaptive Scheduler with Hybrid Partitioning for Mixed‑Workload LLM Inference proposes a new request‑level scheduler that learns the workload on‑the‑fly and dynamically routes requests to the most appropriate execution path. By doing so, it cuts tail latency for interactive queries and boosts overall hardware utilization.

Key Contributions

  • Refine‑and‑Prune partitioning – an unsupervised algorithm that automatically groups incoming requests into performance‑homogeneous clusters without any prior workload profiling.
  • Dynamic Queue Routing – a lightweight runtime component that assigns each request to the appropriate cluster based on its estimated “effective work”.
  • Density‑Weighted Scoring – a novel priority function that blends urgency (e.g., remaining tokens) with fairness, preventing starvation of long jobs while still favoring short ones.
  • Bayesian Meta‑Optimization – a closed‑loop tuner that continuously adjusts partitioning thresholds and scoring weights using live latency and throughput metrics.
  • Integration with vLLM – the authors embed EWSJF into the open‑source vLLM inference engine and demonstrate >30 % throughput gains and up to 4× lower time‑to‑first‑token (TTFT) for short requests versus vanilla FCFS.

Methodology

  1. Workload Observation – As requests arrive, the scheduler extracts simple features (token length, model version, request type) and monitors their execution latency.

  2. Unsupervised Grouping (Refine‑and‑Prune) – Using a clustering step (e.g., Gaussian Mixture Models) followed by a pruning phase, the system discovers “dense” regions where requests exhibit similar latency‑to‑work ratios. These become queues that share a common execution profile.

  3. Routing Logic – When a new request comes in, a lightweight classifier predicts which queue will give it the best service based on its current effective work estimate (tokens remaining ÷ expected throughput).

  4. Priority Scoring – Inside each queue, requests are ordered by a density‑weighted score:

    [ \text{score} = \frac{w_{\text{urgency}}}{\text{remaining_tokens}} + w_{\text{fairness}} \times \text{queue_density} ]

    The weights are tuned so that very short queries jump to the front, while longer jobs still make progress.

  5. Bayesian Meta‑Optimization – A Bayesian optimizer treats the scoring weights and the clustering hyper‑parameters as latent variables. It periodically samples new configurations, runs a short evaluation on live traffic, and updates a posterior distribution to converge on the best settings.

All components sit upstream of the low‑level GPU scheduler, meaning they can be dropped into any existing LLM serving stack without rewriting the kernel‑level dispatch logic.

Results & Findings

MetricFCFS (baseline)EWSJF (paper)
End‑to‑end throughput (tokens / s)1.00× (baseline)+30 %
Avg. TTFT for ≤ 64‑token queries120 ms≈ 30 ms (≈ 4× faster)
99‑th‑percentile latency (interactive)500 ms≈ 180 ms
GPU utilization (average)68 %≈ 85 %

Key takeaways

  • By separating short and long jobs into adaptive queues, the scheduler eliminates head‑of‑line blocking that plagues FCFS.
  • The Bayesian tuner quickly adapts to workload shifts (e.g., a sudden batch‑job surge) without manual re‑configuration.
  • Hardware is kept busier because long jobs can be packed together, while short jobs get immediate access to free compute slots.

Practical Implications

  • LLM SaaS providers can integrate EWSJF to meet strict SLAs for interactive chat while still maximizing batch‑processing revenue.
  • Edge inference platforms (e.g., on‑device assistants) benefit from lower TTFT, improving user experience without needing larger GPUs.
  • DevOps tooling: The Bayesian meta‑optimizer can be exposed as a simple API, allowing operators to set high‑level goals (e.g., “keep 99‑pct latency < 200 ms”) and let the system auto‑tune.
  • Cost efficiency: Higher GPU utilization translates directly into lower cloud‑compute bills for the same throughput, a compelling ROI argument for enterprises.
  • Open‑source adoption: Since the implementation lives on top of vLLM, teams already using that stack can drop in the scheduler with a few configuration changes, accelerating experimentation.

Limitations & Future Work

  • Model‑specific tuning – The current clustering assumes a relatively stable latency‑to‑token relationship; very heterogeneous models (e.g., mixture of encoder‑decoder and decoder‑only) may need separate partitioning strategies.
  • Cold‑start latency – The Refine‑and‑Prune step requires a brief observation window to form meaningful clusters; during sudden traffic spikes the scheduler may temporarily fall back to FCFS.
  • Scalability of Bayesian optimization – While lightweight for a single node, the meta‑optimizer could become a bottleneck in large multi‑node deployments; distributed Bayesian methods are a promising direction.
  • Fairness beyond latency – The paper focuses on latency fairness; future work could incorporate cost or priority tiers (e.g., paid vs. free users) into the scoring function.

Overall, EWSJF demonstrates that a modest, learning‑driven layer on top of existing inference engines can unlock substantial performance gains for mixed‑workload LLM serving—a win for both developers and end‑users.

Authors

  • Bronislav Sidik
  • Chaya Levi
  • Joseph Kampeas

Paper Information

  • arXiv ID: 2601.21758v1
  • Categories: cs.DC, cs.AI
  • Published: January 29, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »