[Paper] PromptTuner: SLO-Aware Elastic System for LLM Prompt Tuning

Published: 22 hours ago (March 5, 2026 at 06:58 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2603.05087v1

Overview

PromptTuner tackles a growing pain point for companies that offer Prompt‑Tuning‑as‑a‑Service on large language models (LLMs). While users care about hitting their Service Level Objectives (SLOs)—e.g., “finish tuning within 30 minutes”—providers also need to keep cloud‑compute bills low. Existing deep‑learning resource managers fall short for this specific workload. PromptTuner introduces a two‑pronged system that (1) picks smart “starter” prompts to speed up convergence and (2) dynamically schedules compute resources to meet SLOs while trimming waste.

Key Contributions

Prompt Bank: A curated repository of high‑quality initial prompts that dramatically reduce the number of tuning iterations needed for a new downstream task.
SLO‑aware Workload Scheduler: An elastic allocation engine that scales GPU/CPU resources up or down in real time based on the current tuning progress and the user’s deadline.
End‑to‑end prototype: Integrated into a realistic Prompt‑Tuning‑as‑a‑Service stack and evaluated against two production‑grade baselines (INFless and ElasticFlow).
Quantitative gains: Demonstrates up to 4.0× fewer SLO violations and 1.6–4.5× lower resource cost compared with the baselines.

Methodology

Characterization Study – The authors first measured how existing resource managers handle prompt‑tuning jobs (e.g., batch size scaling, auto‑scaling policies) and identified mismatches with SLO‑driven goals.
Prompt Bank Construction – They mined a large corpus of successful prompts from prior tuning runs, clustering them by task similarity and ranking them by convergence speed. When a new tuning request arrives, the system selects the top‑k candidates as warm‑starts.
Elastic Scheduler Design – The scheduler continuously monitors two signals: (a) training loss convergence (how quickly the prompt is learning) and (b) time‑to‑deadline (remaining SLO budget). It then decides whether to add more GPUs, throttle the batch size, or pause/resume jobs to keep the deadline on track while avoiding over‑provisioning.
Evaluation Setup – Experiments were run on a cluster of NVIDIA A100 GPUs across several benchmark downstream tasks (e.g., sentiment analysis, question answering). The authors compared PromptTuner against INFless (a latency‑focused elastic system) and ElasticFlow (a cost‑aware scheduler) using identical workloads and SLO settings.

Results & Findings

Metric	PromptTuner vs. INFless	PromptTuner vs. ElasticFlow
SLO violations	↓ 4.0× fewer missed deadlines	↓ 7.9× fewer missed deadlines
Compute cost	↓ 1.6× lower spend	↓ 4.5× lower spend
Convergence epochs	30 % fewer epochs on average (thanks to Prompt Bank)	—
Resource elasticity latency	< 5 s to spin up additional GPU	—

These numbers show that a smarter initialization (Prompt Bank) plus a deadline‑aware scaling policy can both accelerate training and cut cloud bills dramatically.

Practical Implications

For SaaS providers: PromptTuner can be dropped into existing Prompt‑Tuning‑as‑a‑Service platforms to meet tighter SLAs without over‑provisioning hardware, directly improving profit margins.
For DevOps teams: The scheduler’s policy logic can be adapted to other iterative ML workloads (e.g., fine‑tuning, hyper‑parameter search) where time‑to‑solution is a hard constraint.
For developers building custom LLM applications: Access to a Prompt Bank means you can start with a “good enough” prompt out‑of‑the‑box, reducing the trial‑and‑error loop and speeding up prototyping.
Cloud cost optimization: By scaling resources only when the convergence curve indicates it’s needed, organizations can avoid the typical “always‑on” over‑provisioning pattern that inflates GPU spend.

Limitations & Future Work

Prompt Bank Generality – The current bank is built from a fixed set of tasks; its effectiveness may drop for highly novel domains where similar prompts are scarce.
Scheduler Overhead – While the scaling latency is low, the system assumes near‑instantaneous GPU provisioning, which may not hold in multi‑tenant public clouds with queue times.
Multi‑tenant Interference – The study focuses on a single‑tenant scenario; future work could explore fairness and interference when many users share the same elastic pool.
Extending Beyond Prompt Tuning – The authors suggest adapting the elastic scheduler to full‑model fine‑tuning or RL‑based instruction tuning as a next step.

Authors

Wei Gao
Peng Sun
Dmitrii Ustiugov
Tianwei Zhang
Yonggang Wen

Paper Information

arXiv ID: 2603.05087v1
Categories: cs.DC
Published: March 5, 2026
PDF: Download PDF

[Paper] PromptTuner: SLO-Aware Elastic System for LLM Prompt Tuning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Radiation Hydrodynamics at Scale: Comparing MPI and Asynchronous Many-Task Runtimes with FleCSI

[Paper] A monitoring system for collecting and aggregating metrics from distributed clouds

[Paper] Scaling Real-Time Traffic Analytics on Edge-Cloud Fabrics for City-Scale Camera Networks

[Paper] Leveraging Structural Knowledge for Solving Election in Anonymous Networks with Shared Randomness