[Paper] OServe: Accelerating LLM Serving via Spatial-Temporal Workload Orchestration
Source: arXiv - 2602.12151v1
Overview
The paper introduces OServe, a new serving system for large language models (LLMs) that dynamically adapts to the spatial (different request sizes and memory footprints) and temporal (changing request mixes over time) heterogeneity of real‑world workloads. By orchestrating heterogeneous model replicas and switching them on‑the‑fly, OServe can squeeze up to 2× the throughput of existing static serving stacks while keeping latency predictable.
Key Contributions
- Workload‑aware scheduler that selects the optimal mix of heterogeneous model deployments (e.g., different quantization levels, sharding strategies) based on the current request distribution.
- Adaptive deployment switching mechanism that migrates or re‑configures model replicas when the predicted workload pattern shifts, without incurring large downtime.
- Comprehensive evaluation on production‑grade traces showing an average 1.5× speed‑up (up to 2×) over state‑of‑the‑art serving frameworks such as vLLM and TGI.
- Open‑source prototype that integrates with popular inference runtimes (TensorRT‑LLM, PyTorch Serve) and can be dropped into existing inference pipelines.
Methodology
- Characterizing Heterogeneity – The authors first profile a suite of LLM deployments (full‑precision, 8‑bit, 4‑bit, tensor‑parallel vs pipeline‑parallel) to build a resource‑performance lookup table (GPU memory ↔ latency ↔ throughput).
- Real‑time Workload Monitoring – A lightweight collector aggregates per‑second statistics: request length, token count, and memory pressure.
- Scheduling Algorithm – Using the lookup table and live metrics, a mixed‑integer linear program (solved with a fast heuristic) decides how many replicas of each deployment type to run on each GPU node. The objective balances throughput maximization and latency SLA compliance.
- Predictive Switching – A short‑term time‑series model (ARIMA‑like) forecasts workload changes. When a forecasted shift exceeds a confidence threshold, OServe triggers a deployment migration: it spins up the new mix in the background, warms it with a small batch of requests, then gracefully drains the old replicas.
- Evaluation Setup – Experiments use real request traces from a cloud‑based chatbot service (≈10 k requests/hour, mix of short prompts and long completions) on a 4‑node GPU cluster (8 × A100 per node). Baselines include static homogeneous deployments and the popular vLLM scheduler.
Results & Findings
| Metric | OServe | vLLM (static) | TGI (static) |
|---|---|---|---|
| Throughput (req/s) | 2.0× peak, 1.5× avg | 1.0× | 0.9× |
| 99th‑pct latency | 120 ms (SLA met) | 210 ms | 230 ms |
| GPU memory utilization | 78 % (balanced) | 92 % (over‑commit) | 85 % |
| Switching overhead | < 5 % of request volume | N/A | N/A |
- The scheduler consistently picks a blend of low‑precision (4‑bit) replicas for high‑throughput short prompts and full‑precision shards for long‑context generations.
- Adaptive switching reduces “cold‑start” penalties: after a workload shift, OServe reaches the new optimal configuration within ~30 seconds, whereas static baselines suffer sustained latency spikes.
- Energy consumption drops ~12 % because the system can retire high‑memory replicas when they’re not needed.
Practical Implications
- Cost Savings for Cloud Providers – By packing more requests onto the same GPU fleet, operators can defer hardware upgrades or reduce spot‑instance spend.
- SLA‑Driven SaaS Products – Chatbot and code‑assistant services can guarantee tighter latency bounds even during traffic bursts (e.g., product launches).
- Developer Flexibility – Teams can expose a single inference endpoint while OServe silently swaps between quantization levels or sharding strategies, eliminating the need to maintain multiple deployment pipelines.
- Edge & On‑Device Scenarios – The same principles can be applied to heterogeneous edge accelerators (CPU, NPU, GPU) where memory constraints vary dramatically across devices.
Limitations & Future Work
- Model Granularity – OServe currently assumes a fixed set of pre‑compiled deployment variants; extending to arbitrary on‑the‑fly quantization would increase flexibility.
- Prediction Accuracy – The workload‑forecasting component works well on diurnal patterns but may lag under abrupt spikes (e.g., flash‑crowd events). More robust online learning models are a promising direction.
- Multi‑Tenant Isolation – The paper focuses on a single tenant’s workload; handling security and fairness across multiple customers would require additional scheduling constraints.
- Hardware Diversity – Experiments are limited to homogeneous A100 clusters; evaluating on mixed‑generation GPUs or emerging accelerators (e.g., Habana, AWS Trainium) is left for future studies.
Authors
- Youhe Jiang
- Fangcheng Fu
- Taiyi Wang
- Guoliang He
- Eiko Yoneki
Paper Information
- arXiv ID: 2602.12151v1
- Categories: cs.DC
- Published: February 12, 2026
- PDF: Download PDF