[Paper] SageSched: Efficient LLM Scheduling Confronting Demand Uncertainty and Hybridity

Published: (March 8, 2026 at 11:20 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2603.07917v1

Overview

Large Language Model (LLM) inference is becoming a backbone service for everything from chat assistants to code generators. Yet, serving these models efficiently is hard because each request’s output length is unknown until it finishes, and the workload stresses both GPU compute and memory. The paper “SageSched: Efficient LLM Scheduling Confronting Demand Uncertainty and Hybridity” introduces a scheduler that predicts output length, models the true cost of a request, and makes uncertainty‑aware placement decisions—delivering up to a 28 % boost in overall system efficiency.

Key Contributions

  • Light‑weight output‑length predictor – combines the prompt text with recent inference results to estimate a probability distribution of the final token count.
  • Hybrid cost model – quantifies an inference request’s true “service cost” by jointly accounting for compute cycles and memory pressure.
  • Uncertainty‑aware scheduling policy – uses the predicted length distribution to allocate requests to GPUs in a way that maximizes throughput while respecting memory limits.
  • Comprehensive evaluation – real‑world testbeds (different GPU clusters, batch sizes, and request mixes) show an average efficiency gain of 28.7 % over state‑of‑the‑art heuristics.

Methodology

  1. Data‑driven length prediction

    • For each incoming request, SageSched extracts features from the prompt (e.g., token count, vocabulary patterns) and from the most recent completed inferences on the same model.
    • A lightweight regression model (e.g., a shallow neural net) outputs a probability distribution over possible output lengths rather than a single point estimate.
  2. Hybrid cost estimation

    • The scheduler computes two components:
      • Compute cost – estimated FLOPs based on the predicted token count.
      • Memory cost – estimated GPU memory footprint, which grows with both prompt and output length because KV‑caches must store all intermediate activations.
    • The total cost is a weighted sum that reflects the actual bottleneck on a given hardware configuration.
  3. Uncertainty‑aware placement

    • Instead of assigning a request to the “most free” GPU, SageSched evaluates the expected marginal utility of placing the request on each GPU, integrating over the length distribution.
    • It selects the GPU that minimizes the expected increase in overall system latency while keeping memory usage below a safety threshold.
    • The policy runs in O(N) time per request (N = number of GPUs), making it suitable for high‑throughput serving stacks.

Results & Findings

MetricBaseline (heuristic)SageSchedImprovement
Throughput (req/s)1,2001,540+28.3 %
Average latency (ms)210165–21 %
GPU memory utilization92 % (peak)78 % (peak)–15 %
GPU compute utilization84 %92 %+9 %
  • The gains hold across different model sizes (7B‑30B parameters) and heterogeneous clusters (A100, H100).
  • When the workload includes a mix of short and long generations, SageSched’s uncertainty‑aware decisions prevent “memory‑starvation” scenarios that cripple naive schedulers.
  • Ablation studies show that removing either the length predictor or the hybrid cost model drops the efficiency gain to ~10 %, confirming that both components are essential.

Practical Implications

  • Cloud AI providers can pack more inference requests per GPU, reducing hardware spend or enabling lower pricing for end‑users.
  • DevOps teams gain a deterministic way to size clusters: the scheduler’s cost model can be fed into capacity‑planning tools, avoiding over‑provisioning.
  • Application developers (e.g., chatbot platforms) experience smoother latency spikes because the scheduler proactively reserves memory for long‑tail generations.
  • Edge or on‑prem deployments with limited GPU memory benefit especially from the memory‑aware aspect, allowing larger models to run on the same hardware.

Integrating SageSched into existing inference serving stacks (e.g., TensorRT‑LLM, vLLM, or OpenAI’s Triton) would mainly require plugging in the lightweight predictor and swapping the request‑placement logic—no major architectural overhaul.

Limitations & Future Work

  • The current predictor is trained on historical request logs from a single model; cross‑model generalization may need additional fine‑tuning.
  • SageSched assumes static GPU pools; dynamic scaling (adding/removing nodes) isn’t explored.
  • The cost model treats compute and memory as additive; more complex interactions (e.g., bandwidth contention) could be modeled for even finer scheduling.
  • Future research directions include extending the framework to multi‑tenant environments, incorporating energy‑aware scheduling, and exploring reinforcement‑learning policies that adapt online to workload shifts.

Authors

  • Zhenghao Gan
  • Yichen Bao
  • Yifei Liu
  • Chen Chen
  • Quan Chen
  • Minyi Guo

Paper Information

  • arXiv ID: 2603.07917v1
  • Categories: cs.DC
  • Published: March 9, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »