[Paper] OServe: Accelerating LLM Serving via Spatial-Temporal Workload Orchestration

Published: 3 days ago (February 12, 2026 at 11:34 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.12151v1

Overview

The paper introduces OServe, a new serving system for large language models (LLMs) that dynamically adapts to the spatial (different request sizes and memory footprints) and temporal (changing request mixes over time) heterogeneity of real‑world workloads. By orchestrating heterogeneous model replicas and switching them on‑the‑fly, OServe can squeeze up to 2× the throughput of existing static serving stacks while keeping latency predictable.

Key Contributions

Workload‑aware scheduler that selects the optimal mix of heterogeneous model deployments (e.g., different quantization levels, sharding strategies) based on the current request distribution.
Adaptive deployment switching mechanism that migrates or re‑configures model replicas when the predicted workload pattern shifts, without incurring large downtime.
Comprehensive evaluation on production‑grade traces showing an average 1.5× speed‑up (up to 2×) over state‑of‑the‑art serving frameworks such as vLLM and TGI.
Open‑source prototype that integrates with popular inference runtimes (TensorRT‑LLM, PyTorch Serve) and can be dropped into existing inference pipelines.

Methodology

Characterizing Heterogeneity – The authors first profile a suite of LLM deployments (full‑precision, 8‑bit, 4‑bit, tensor‑parallel vs pipeline‑parallel) to build a resource‑performance lookup table (GPU memory ↔ latency ↔ throughput).
Real‑time Workload Monitoring – A lightweight collector aggregates per‑second statistics: request length, token count, and memory pressure.
Scheduling Algorithm – Using the lookup table and live metrics, a mixed‑integer linear program (solved with a fast heuristic) decides how many replicas of each deployment type to run on each GPU node. The objective balances throughput maximization and latency SLA compliance.
Predictive Switching – A short‑term time‑series model (ARIMA‑like) forecasts workload changes. When a forecasted shift exceeds a confidence threshold, OServe triggers a deployment migration: it spins up the new mix in the background, warms it with a small batch of requests, then gracefully drains the old replicas.
Evaluation Setup – Experiments use real request traces from a cloud‑based chatbot service (≈10 k requests/hour, mix of short prompts and long completions) on a 4‑node GPU cluster (8 × A100 per node). Baselines include static homogeneous deployments and the popular vLLM scheduler.

Results & Findings

Metric	OServe	vLLM (static)	TGI (static)
Throughput (req/s)	2.0× peak, 1.5× avg	1.0×	0.9×
99th‑pct latency	120 ms (SLA met)	210 ms	230 ms
GPU memory utilization	78 % (balanced)	92 % (over‑commit)	85 %
Switching overhead	< 5 % of request volume	N/A	N/A

The scheduler consistently picks a blend of low‑precision (4‑bit) replicas for high‑throughput short prompts and full‑precision shards for long‑context generations.
Adaptive switching reduces “cold‑start” penalties: after a workload shift, OServe reaches the new optimal configuration within ~30 seconds, whereas static baselines suffer sustained latency spikes.
Energy consumption drops ~12 % because the system can retire high‑memory replicas when they’re not needed.

Practical Implications

Cost Savings for Cloud Providers – By packing more requests onto the same GPU fleet, operators can defer hardware upgrades or reduce spot‑instance spend.
SLA‑Driven SaaS Products – Chatbot and code‑assistant services can guarantee tighter latency bounds even during traffic bursts (e.g., product launches).
Developer Flexibility – Teams can expose a single inference endpoint while OServe silently swaps between quantization levels or sharding strategies, eliminating the need to maintain multiple deployment pipelines.
Edge & On‑Device Scenarios – The same principles can be applied to heterogeneous edge accelerators (CPU, NPU, GPU) where memory constraints vary dramatically across devices.

Limitations & Future Work

Model Granularity – OServe currently assumes a fixed set of pre‑compiled deployment variants; extending to arbitrary on‑the‑fly quantization would increase flexibility.
Prediction Accuracy – The workload‑forecasting component works well on diurnal patterns but may lag under abrupt spikes (e.g., flash‑crowd events). More robust online learning models are a promising direction.
Multi‑Tenant Isolation – The paper focuses on a single tenant’s workload; handling security and fairness across multiple customers would require additional scheduling constraints.
Hardware Diversity – Experiments are limited to homogeneous A100 clusters; evaluating on mixed‑generation GPUs or emerging accelerators (e.g., Habana, AWS Trainium) is left for future studies.

Authors

Youhe Jiang
Fangcheng Fu
Taiyi Wang
Guoliang He
Eiko Yoneki

Paper Information

arXiv ID: 2602.12151v1
Categories: cs.DC
Published: February 12, 2026
PDF: Download PDF

[Paper] OServe: Accelerating LLM Serving via Spatial-Temporal Workload Orchestration

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Legitimate Overrides in Decentralized Protocols

[Paper] Contention Resolution, With and Without a Global Clock

[Paper] An Auction-Based Mechanism for Optimal Task Allocation and Resource Aware Containerization

[Paper] Designing Scalable Rate Limiting Systems: Algorithms, Architecture, and Distributed Solutions