[Paper] SAIR: Cost-Efficient Multi-Stage ML Pipeline Autoscaling via In-Context Reinforcement Learning
Source: arXiv - 2601.22397v1
Overview
The paper introduces SAIR, a novel autoscaling system for multi‑stage machine‑learning inference pipelines. By harnessing a large language model (LLM) as an in‑context reinforcement‑learning controller, SAIR can dynamically adjust compute resources without any offline training, delivering dramatically lower tail latency and cost for real‑world serving workloads.
Key Contributions
- LLM‑based in‑context RL controller – uses prompt‑engineered interaction histories to improve scaling policies on‑the‑fly, avoiding costly gradient updates.
- Pareto‑dominance reward shaping with a provable separation margin, enabling the controller to focus on actions that truly improve latency‑cost trade‑offs.
- Surprisal‑guided experience retrieval – selects the most informative past episodes for the LLM context, keeping prompts short while preserving decision quality.
- Fine‑grained GPU rate control via user‑space CUDA interception, allowing the system to throttle GPU throughput at sub‑process granularity.
- Theoretical regret analysis that decomposes error into retrieval coverage and LLM selection components, offering formal insight into performance bounds.
- Extensive empirical validation on four production‑grade inference pipelines (e.g., vision transformer, speech‑to‑text) under three realistic traffic patterns, achieving up to 50 % lower P99 latency and 97 % lower effective cost compared with state‑of‑the‑art autoscalers.
Methodology
-
Problem framing – Autoscaling is modeled as a sequential decision problem: at each time step the controller selects a scaling action (e.g., add/remove GPU workers, adjust per‑GPU rate) for each pipeline stage. The objective is to minimize a weighted sum of tail latency (P99) and resource cost.
-
In‑context RL with an LLM – Instead of training a policy network, SAIR stores a rolling log of state‑action‑reward tuples. When a new decision is needed, it constructs a prompt that includes:
- A concise description of the current pipeline state (queue lengths, GPU utilization, recent latency).
- A handful of the most “surprising” past episodes (high surprisal score) that are most relevant to the current context.
- The Pareto‑dominance reward function definition.
The LLM then generates the next scaling action as natural‑language output, which is parsed back into concrete resource commands.
-
Reward shaping – Rewards are computed using a Pareto dominance check: an action receives a positive reward only if it improves both latency and cost relative to the previous action, otherwise a small penalty is applied. The authors prove a separation margin guaranteeing that truly superior actions are distinguishable even with noisy measurements.
-
Experience retrieval – To keep prompts within token limits, SAIR ranks stored episodes by surprisal (how unlikely the LLM would have predicted the observed reward). High‑surprisal episodes are most informative for learning and are preferentially inserted into the prompt.
-
GPU rate control – A lightweight user‑space library intercepts CUDA API calls (e.g.,
cudaMemcpy, kernel launches) and injects throttling delays, enabling the controller to fine‑tune the effective throughput of each GPU without kernel‑level modifications. -
Regret analysis – The authors bound the cumulative regret as the sum of (i) retrieval coverage error (probability that the most relevant episode is omitted from the prompt) and (ii) LLM selection error (probability the LLM picks a sub‑optimal action given the prompt). This decomposition guides system design choices such as prompt size and retrieval strategy.
Results & Findings
| Workload | Baseline (e.g., K8s HPA) | SAIR P99 Latency | SAIR Effective Cost* |
|---|---|---|---|
| Vision‑Transformer (steady) | 120 ms | 68 ms (‑43 %) | 0.03 × (‑97 %) |
| Speech‑to‑Text (burst) | 210 ms | 105 ms (‑50 %) | 0.07 × (‑93 %) |
| Recommendation (periodic spikes) | 180 ms | 92 ms (‑49 %) | 0.05 × (‑95 %) |
| Multi‑modal (mixed) | 250 ms | 125 ms (‑50 %) | 0.06 × (‑94 %) |
*Effective cost assumes the GPU rate‑control mechanism can proportionally reduce billed GPU time.
Additional observations
- Bottleneck detection accuracy of 86 % – SAIR correctly identifies which stage will become the latency limiter in most time windows, enabling proactive scaling.
- Zero offline training – The system starts making sensible decisions after only a few minutes of live traffic, thanks to the LLM’s pre‑trained knowledge and the reward‑shaping scheme.
- Robustness to workload patterns – Across steady, bursty, and periodic traffic, SAIR consistently matches or outperforms the best tuned static autoscaling policies.
Practical Implications
- For cloud‑native ML services – Operators can replace heavyweight custom autoscalers with a plug‑and‑play SAIR module, cutting down on engineering effort and cloud spend.
- GPU‑intensive inference – Fine‑grained rate control lets teams squeeze more inference requests per GPU without sacrificing latency, effectively “virtualizing” GPU capacity.
- Rapid prototyping – Since SAIR needs no offline RL training, new pipelines (e.g., a fresh transformer model) can be deployed and autoscaled immediately, accelerating time‑to‑market.
- Cross‑stage coordination – Traditional autoscalers treat each microservice in isolation; SAIR’s holistic view prevents “ping‑pong” effects where scaling one stage creates a new bottleneck downstream.
- Potential integration points – SAIR can be wrapped as a Kubernetes custom controller, a serverless function, or a sidecar that intercepts CUDA calls, making it adaptable to existing DevOps pipelines.
Limitations & Future Work
- Dependence on LLM prompt length – The approach is bounded by token limits; extremely long pipelines may require more aggressive summarization or hierarchical retrieval.
- GPU rate‑control assumptions – The cost savings assume that throttling translates directly into lower billing, which may not hold on all cloud providers or with spot‑instance pricing models.
- Surprisal computation overhead – Calculating surprisal for every stored episode adds modest CPU load; scaling to millions of episodes would need more efficient indexing.
- Generalization to non‑GPU resources – The current design focuses on GPU throttling; extending SAIR to CPUs, TPUs, or FPGAs would broaden its applicability.
- Safety guarantees – While the reward shaping provides a theoretical separation margin, formal verification of safety‑critical latency SLAs remains an open research direction.
Overall, SAIR showcases how large language models can serve as flexible, zero‑training controllers for complex systems, opening a promising avenue for cost‑efficient, high‑performance ML serving.
Authors
- Jianchang Su
- Yifan Zhang
- Shengkai Lin
- Shizhen Zhao
- Yusheng Zheng
- Yiwei Yang
- Wei Zhang
Paper Information
- arXiv ID: 2601.22397v1
- Categories: cs.LG, cs.DC
- Published: January 29, 2026
- PDF: Download PDF