[Paper] SageSched: Efficient LLM Scheduling Confronting Demand Uncertainty and Hybridity

Published: 2 days ago (March 8, 2026 at 11:20 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2603.07917v1

Overview

Large Language Model (LLM) inference is becoming a backbone service for everything from chat assistants to code generators. Yet, serving these models efficiently is hard because each request’s output length is unknown until it finishes, and the workload stresses both GPU compute and memory. The paper “SageSched: Efficient LLM Scheduling Confronting Demand Uncertainty and Hybridity” introduces a scheduler that predicts output length, models the true cost of a request, and makes uncertainty‑aware placement decisions—delivering up to a 28 % boost in overall system efficiency.

Key Contributions

Light‑weight output‑length predictor – combines the prompt text with recent inference results to estimate a probability distribution of the final token count.
Hybrid cost model – quantifies an inference request’s true “service cost” by jointly accounting for compute cycles and memory pressure.
Uncertainty‑aware scheduling policy – uses the predicted length distribution to allocate requests to GPUs in a way that maximizes throughput while respecting memory limits.
Comprehensive evaluation – real‑world testbeds (different GPU clusters, batch sizes, and request mixes) show an average efficiency gain of 28.7 % over state‑of‑the‑art heuristics.

Methodology

Data‑driven length prediction
- For each incoming request, SageSched extracts features from the prompt (e.g., token count, vocabulary patterns) and from the most recent completed inferences on the same model.
- A lightweight regression model (e.g., a shallow neural net) outputs a probability distribution over possible output lengths rather than a single point estimate.
Hybrid cost estimation
- The scheduler computes two components:
  - Compute cost – estimated FLOPs based on the predicted token count.
  - Memory cost – estimated GPU memory footprint, which grows with both prompt and output length because KV‑caches must store all intermediate activations.
- The total cost is a weighted sum that reflects the actual bottleneck on a given hardware configuration.
Uncertainty‑aware placement
- Instead of assigning a request to the “most free” GPU, SageSched evaluates the expected marginal utility of placing the request on each GPU, integrating over the length distribution.
- It selects the GPU that minimizes the expected increase in overall system latency while keeping memory usage below a safety threshold.
- The policy runs in O(N) time per request (N = number of GPUs), making it suitable for high‑throughput serving stacks.

Results & Findings

Metric	Baseline (heuristic)	SageSched	Improvement
Throughput (req/s)	1,200	1,540	+28.3 %
Average latency (ms)	210	165	–21 %
GPU memory utilization	92 % (peak)	78 % (peak)	–15 %
GPU compute utilization	84 %	92 %	+9 %

The gains hold across different model sizes (7B‑30B parameters) and heterogeneous clusters (A100, H100).
When the workload includes a mix of short and long generations, SageSched’s uncertainty‑aware decisions prevent “memory‑starvation” scenarios that cripple naive schedulers.
Ablation studies show that removing either the length predictor or the hybrid cost model drops the efficiency gain to ~10 %, confirming that both components are essential.

Practical Implications

Cloud AI providers can pack more inference requests per GPU, reducing hardware spend or enabling lower pricing for end‑users.
DevOps teams gain a deterministic way to size clusters: the scheduler’s cost model can be fed into capacity‑planning tools, avoiding over‑provisioning.
Application developers (e.g., chatbot platforms) experience smoother latency spikes because the scheduler proactively reserves memory for long‑tail generations.
Edge or on‑prem deployments with limited GPU memory benefit especially from the memory‑aware aspect, allowing larger models to run on the same hardware.

Integrating SageSched into existing inference serving stacks (e.g., TensorRT‑LLM, vLLM, or OpenAI’s Triton) would mainly require plugging in the lightweight predictor and swapping the request‑placement logic—no major architectural overhaul.

Limitations & Future Work

The current predictor is trained on historical request logs from a single model; cross‑model generalization may need additional fine‑tuning.
SageSched assumes static GPU pools; dynamic scaling (adding/removing nodes) isn’t explored.
The cost model treats compute and memory as additive; more complex interactions (e.g., bandwidth contention) could be modeled for even finer scheduling.
Future research directions include extending the framework to multi‑tenant environments, incorporating energy‑aware scheduling, and exploring reinforcement‑learning policies that adapt online to workload shifts.

Authors

Zhenghao Gan
Yichen Bao
Yifei Liu
Chen Chen
Quan Chen
Minyi Guo

Paper Information

arXiv ID: 2603.07917v1
Categories: cs.DC
Published: March 9, 2026
PDF: Download PDF

[Paper] SageSched: Efficient LLM Scheduling Confronting Demand Uncertainty and Hybridity

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] The Bureaucracy of Speed: Structural Equivalence Between Memory Consistency Models and Multi-Agent Authorization Revocation

[Paper] Rate-Distortion Bounds for Heterogeneous Random Fields on Finite Lattices

[Paper] Ensuring Data Freshness in Multi-Rate Task Chains Scheduling

[Paper] Randomized Distributed Function Computation (RDFC): Ultra-Efficient Semantic Communication Applications to Privacy