[Paper] WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving
Source: arXiv - 2512.09472v1
Overview
Deploying several large language models (LLMs) on the same GPU cluster can boost overall utilization, but it often hurts the latency that users see when a request first arrives—known as time‑to‑first‑token (TTFT). The new WarmServe system tackles this problem by predictively “pre‑warming” GPUs with the right models before they are needed, turning a traditionally reactive scaling approach into a proactive one.
Key Contributions
- One‑for‑many GPU prewarming: Introduces universal GPU workers that can host any LLM and be prepared in advance based on workload forecasts.
- Evict‑aware placement: A scheduler that decides where to place models so that prewarming does not cause costly evictions across the cluster.
- Zero‑overhead memory switching: A lightweight mechanism that swaps model weights in GPU memory without pausing inference, eliminating the usual “cold‑start” delay.
- Real‑world validation: Experiments on production‑grade traces show up to 50.8× faster TTFT versus autoscaling baselines and up to 2.5× higher request throughput compared with existing GPU‑sharing solutions.
Methodology
- Workload Prediction – The authors first analyze production logs and confirm that LLM request patterns are highly periodic (e.g., daily peaks). They feed these forecasts into the scheduler.
- Universal GPU Workers – Instead of dedicating a GPU to a specific model, each worker runs a lightweight runtime capable of loading any model on demand. The worker stays “warm” (GPU memory allocated, kernels initialized) even when no request is currently using it.
- Evict‑aware Placement – When a new request arrives, WarmServe checks whether loading the required model would evict another model that is likely to be needed soon. If so, it chooses a different GPU or postpones the eviction, balancing memory pressure against future demand.
- Zero‑overhead Switching – Model weights are stored in a pinned CPU‑side buffer. When a switch is required, WarmServe streams the needed weights directly into the pre‑allocated GPU memory region, overlapping the transfer with ongoing inference for other requests. This avoids the typical pause that occurs when a model is first loaded.
The whole pipeline runs as a thin layer on top of existing serving frameworks (e.g., TensorRT‑LLM, vLLM), making it easy to drop into current deployments.
Results & Findings
| Metric | WarmServe vs. Autoscaling | WarmServe vs. GPU‑Sharing |
|---|---|---|
| TTFT (median) | +50.8× faster (cold‑start reduced from ~2 s to ~40 ms) | Comparable, but with higher overall throughput |
| Throughput | 1.8× more requests per GPU | 2.5× more requests overall |
| GPU Utilization | 68 % average (vs. 45 % for autoscaling) | 73 % average (vs. 55 % for naive sharing) |
| Memory Overhead | < 5 % extra for universal worker buffers | Negligible |
The authors also show that WarmServe’s proactive prewarming adapts gracefully to workload spikes: when a predicted surge occurs, the system already has the needed models resident, eliminating the “ramp‑up” latency that plagues reactive autoscalers.
Practical Implications
- Lower latency for end‑users: Applications that rely on LLMs for chat, code completion, or real‑time summarization can deliver responses almost instantly, improving user experience and retention.
- Higher ROI on GPU hardware: By squeezing more requests out of the same GPU fleet, cloud providers and enterprises can defer costly hardware upgrades.
- Simplified ops: WarmServe reduces the need for manual tuning of autoscaling thresholds and model placement policies—most decisions are driven by the workload predictor.
- Compatibility with existing stacks: Since WarmServe sits atop popular inference runtimes, teams can adopt it without rewriting model code or retraining models.
- Potential for edge deployment: The universal worker concept could be extended to on‑device GPUs (e.g., NVIDIA Jetson) where memory is scarce but workloads are predictable (e.g., periodic voice‑assistant queries).
Limitations & Future Work
- Prediction dependence: WarmServe’s gains hinge on accurate workload forecasts; sudden, non‑periodic traffic bursts could still cause cold‑starts.
- Memory footprint: Maintaining universal workers incurs a modest memory overhead, which may become significant on very small GPUs.
- Model size constraints: Extremely large models that exceed a single GPU’s memory still require model parallelism, a scenario not fully addressed by the current design.
- Future directions suggested by the authors include tighter integration with reinforcement‑learning‑based schedulers, support for multi‑GPU model sharding, and extending the approach to other accelerator types (TPUs, ASICs).
Authors
- Chiheng Lou
- Sheng Qi
- Rui Kang
- Yong Zhang
- Chen Sun
- Pengcheng Wang
- Bingyang Liu
- Xuanzhe Liu
- Xin Jin
Paper Information
- arXiv ID: 2512.09472v1
- Categories: cs.DC, cs.LG
- Published: December 10, 2025
- PDF: Download PDF