[Paper] WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving

Published: 2 months ago (December 10, 2025 at 04:47 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.09472v1

Overview

Deploying several large language models (LLMs) on the same GPU cluster can boost overall utilization, but it often hurts the latency that users see when a request first arrives—known as time‑to‑first‑token (TTFT). The new WarmServe system tackles this problem by predictively “pre‑warming” GPUs with the right models before they are needed, turning a traditionally reactive scaling approach into a proactive one.

Key Contributions

One‑for‑many GPU prewarming: Introduces universal GPU workers that can host any LLM and be prepared in advance based on workload forecasts.
Evict‑aware placement: A scheduler that decides where to place models so that prewarming does not cause costly evictions across the cluster.
Zero‑overhead memory switching: A lightweight mechanism that swaps model weights in GPU memory without pausing inference, eliminating the usual “cold‑start” delay.
Real‑world validation: Experiments on production‑grade traces show up to 50.8× faster TTFT versus autoscaling baselines and up to 2.5× higher request throughput compared with existing GPU‑sharing solutions.

Methodology

Workload Prediction – The authors first analyze production logs and confirm that LLM request patterns are highly periodic (e.g., daily peaks). They feed these forecasts into the scheduler.
Universal GPU Workers – Instead of dedicating a GPU to a specific model, each worker runs a lightweight runtime capable of loading any model on demand. The worker stays “warm” (GPU memory allocated, kernels initialized) even when no request is currently using it.
Evict‑aware Placement – When a new request arrives, WarmServe checks whether loading the required model would evict another model that is likely to be needed soon. If so, it chooses a different GPU or postpones the eviction, balancing memory pressure against future demand.
Zero‑overhead Switching – Model weights are stored in a pinned CPU‑side buffer. When a switch is required, WarmServe streams the needed weights directly into the pre‑allocated GPU memory region, overlapping the transfer with ongoing inference for other requests. This avoids the typical pause that occurs when a model is first loaded.

The whole pipeline runs as a thin layer on top of existing serving frameworks (e.g., TensorRT‑LLM, vLLM), making it easy to drop into current deployments.

Results & Findings

Metric	WarmServe vs. Autoscaling	WarmServe vs. GPU‑Sharing
TTFT (median)	+50.8× faster (cold‑start reduced from ~2 s to ~40 ms)	Comparable, but with higher overall throughput
Throughput	1.8× more requests per GPU	2.5× more requests overall
GPU Utilization	68 % average (vs. 45 % for autoscaling)	73 % average (vs. 55 % for naive sharing)
Memory Overhead	< 5 % extra for universal worker buffers	Negligible

The authors also show that WarmServe’s proactive prewarming adapts gracefully to workload spikes: when a predicted surge occurs, the system already has the needed models resident, eliminating the “ramp‑up” latency that plagues reactive autoscalers.

Practical Implications

Lower latency for end‑users: Applications that rely on LLMs for chat, code completion, or real‑time summarization can deliver responses almost instantly, improving user experience and retention.
Higher ROI on GPU hardware: By squeezing more requests out of the same GPU fleet, cloud providers and enterprises can defer costly hardware upgrades.
Simplified ops: WarmServe reduces the need for manual tuning of autoscaling thresholds and model placement policies—most decisions are driven by the workload predictor.
Compatibility with existing stacks: Since WarmServe sits atop popular inference runtimes, teams can adopt it without rewriting model code or retraining models.
Potential for edge deployment: The universal worker concept could be extended to on‑device GPUs (e.g., NVIDIA Jetson) where memory is scarce but workloads are predictable (e.g., periodic voice‑assistant queries).

Limitations & Future Work

Prediction dependence: WarmServe’s gains hinge on accurate workload forecasts; sudden, non‑periodic traffic bursts could still cause cold‑starts.
Memory footprint: Maintaining universal workers incurs a modest memory overhead, which may become significant on very small GPUs.
Model size constraints: Extremely large models that exceed a single GPU’s memory still require model parallelism, a scenario not fully addressed by the current design.
Future directions suggested by the authors include tighter integration with reinforcement‑learning‑based schedulers, support for multi‑GPU model sharding, and extending the approach to other accelerator types (TPUs, ASICs).

Authors

Chiheng Lou
Sheng Qi
Rui Kang
Yong Zhang
Chen Sun
Pengcheng Wang
Bingyang Liu
Xuanzhe Liu
Xin Jin

Paper Information

arXiv ID: 2512.09472v1
Categories: cs.DC, cs.LG
Published: December 10, 2025
PDF: Download PDF

[Paper] WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Particulate: Feed-Forward 3D Object Articulation

[Paper] A General Algorithm for Detecting Higher-Order Interactions via Random Sequential Additions

[Paper] Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective

[Paper] Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultaneously