[Paper] Predictive-LoRA: A Proactive and Fragmentation-Aware Serverless Inference System for LLMs
Source: arXiv - 2512.20210v1
Overview
Predictive‑LoRA (P‑LoRA) tackles two pain points that developers hit when they serve many fine‑tuned Large Language Models (LLMs) in a server‑less environment: the “cold‑start” delay caused by loading adapters on demand, and the GPU memory fragmentation that arises when adapters of different sizes are swapped in and out. By forecasting which adapters will be needed next and by managing GPU memory with a page‑based scheme, P‑LoRA cuts latency and boosts throughput, making server‑less LLM inference more practical for production workloads.
Key Contributions
- Traffic‑aware prefetching: An ultra‑lightweight LSTM predictor forecasts adapter demand from incoming request streams and proactively moves hot adapters from host RAM to GPU memory, slashing cold‑start latency by up to 68 %.
- Fragmentation‑aware memory manager: A page‑based allocation strategy (inspired by OS virtual memory) packs adapters of heterogeneous rank efficiently, keeping GPU utilization > 87 % even under mixed‑size workloads.
- System‑level integration: P‑LoRA is built as a drop‑in replacement for existing serverless inference runtimes (e.g., Azure Functions), requiring only minimal code changes.
- Comprehensive evaluation: Using Azure Functions traces, the authors show 1.52× higher throughput and 35 % lower average Time‑to‑First‑Token (TTFT) compared with the prior S‑LoRA baseline under high concurrency.
Methodology
- Workload characterization – The authors first analyzed real‑world serverless function logs to understand request arrival patterns, adapter popularity distribution, and concurrency spikes.
- Demand prediction – A single‑layer LSTM model (≈ 10 KB) is trained online on recent request timestamps and adapter IDs. The model outputs a short‑term probability map of which adapters will be needed in the next few seconds.
- Proactive prefetching – When the predictor flags an adapter as “hot,” a background thread copies the adapter’s low‑rank weight matrix from host memory to a pre‑allocated GPU page pool, overlapping I/O with ongoing inference.
- Page‑based memory management – GPU memory is divided into fixed‑size pages (e.g., 4 MiB). Each adapter is stored as a set of pages; a simple first‑fit allocator with compaction merges free pages, preventing the “holes” that normally arise when adapters of different sizes are loaded/unloaded.
- Evaluation harness – The system is benchmarked against S‑LoRA using a trace‑driven simulator that reproduces Azure Functions request inter‑arrival times, concurrency levels, and adapter mix. Metrics include TTFT, overall throughput (requests/s), and GPU memory utilization.
Results & Findings
| Metric | P‑LoRA | S‑LoRA (baseline) | Improvement |
|---|---|---|---|
| Avg. TTFT | 210 ms | 322 ms | 35 % reduction |
| Peak throughput (req/s) | 1,820 | 1,200 | 1.52× |
| GPU memory utilization | 88 % | 71 % | +17 pts |
| Cold‑start latency (worst‑case) | 480 ms | 1,520 ms | 68 % cut |
- The LSTM predictor achieved > 90 % accuracy in identifying the top‑3 adapters that would dominate the next 5‑second window.
- Memory fragmentation dropped from an average of 22 % (S‑LoRA) to < 5 % with the page allocator, directly translating into higher concurrent model capacity.
- Under bursty traffic (up to 500 concurrent invocations), P‑LoRA maintained stable latency, whereas S‑LoRA suffered sharp TTFT spikes due to repeated adapter swaps.
Practical Implications
- Faster user experiences: Developers can ship LLM‑powered APIs (e.g., chat assistants, code completion) with noticeably lower first‑token latency, which is critical for interactive applications.
- Cost efficiency: Higher GPU utilization means fewer GPUs are needed to serve the same request volume, lowering cloud spend for pay‑per‑use serverless platforms.
- Simplified ops: The proactive prefetching removes the need for manual “warm‑up” scripts or over‑provisioning of adapters, letting teams rely on the system to keep hot adapters resident.
- Scalable multi‑tenant services: SaaS providers can host dozens of fine‑tuned LoRA adapters on a single GPU cluster without worrying about fragmentation, enabling per‑customer model customization at scale.
- Portability: Because the predictor and memory manager are lightweight, they can be integrated into other serverless runtimes (AWS Lambda, Google Cloud Functions) or even on‑premise inference gateways.
Limitations & Future Work
- Predictor horizon: The LSTM is tuned for short‑term forecasts (seconds). Longer‑term workload shifts (e.g., diurnal patterns) may still cause occasional cold starts.
- Static page size: A fixed page granularity simplifies allocation but may be sub‑optimal for extremely large adapters; adaptive page sizing could further reduce fragmentation.
- Hardware dependence: The current implementation assumes a single‑GPU node; extending the scheme to multi‑GPU or heterogeneous accelerator clusters (TPU, Habana) is left for future research.
- Security considerations: Prefetching adapters across tenants raises isolation questions; the authors note the need for sandboxed memory regions to prevent cross‑tenant leakage.
Overall, Predictive‑LoRA demonstrates that a blend of lightweight traffic prediction and OS‑inspired memory management can make serverless LLM inference both faster and more resource‑efficient—an encouraging step toward truly elastic, on‑demand AI services.
Authors
- Yinan Ni
- Xiao Yang
- Yuqi Tang
- Zhimin Qiu
- Chen Wang
- Tingzhou Yuan
Paper Information
- arXiv ID: 2512.20210v1
- Categories: cs.DC
- Published: December 23, 2025
- PDF: Download PDF