[Paper] An SLO Driven and Cost-Aware Autoscaling Framework for Kubernetes

Published: 3 weeks ago (December 29, 2025 at 07:20 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.23415v1

Overview

The paper presents a new autoscaling framework for Kubernetes that puts Service Level Objectives (SLOs) and cost efficiency at the forefront. By blending lightweight demand forecasting with AIOps‑style multi‑signal control, the authors show how cloud‑native workloads can be scaled more proactively, safely, and transparently than with the built‑in Horizontal/Vertical Pod Autoscalers.

Key Contributions

Gap‑driven analysis of existing Kubernetes autoscalers, pinpointing why they often miss SLO targets or overspend on resources.
A safe, explainable multi‑signal autoscaling loop that consumes both infrastructure metrics (CPU, memory) and application‑level signals (latency, request rates).
Integrated SLO‑aware and cost‑aware controller that balances performance guarantees against a budget constraint.
Lightweight demand forecasting module (using simple time‑series techniques) that feeds the controller with short‑term workload predictions.
Extensive experimental evaluation on microservice and event‑driven benchmarks, demonstrating up to 31 % reduction in SLO violation time, 24 % faster scaling response, and 18 % lower infrastructure cost versus tuned Kubernetes defaults.

Methodology

Signal Collection – The framework taps into Prometheus‑style metrics from the cluster (CPU, memory, pod counts) and application‑level KPIs (e.g., 95th‑percentile latency).
Demand Forecasting – A lightweight ARIMA/Exponential Smoothing model predicts request volume for the next few minutes, avoiding heavyweight ML pipelines.
Control Engine – A rule‑based, safety‑checked controller evaluates three constraints:
- SLO feasibility (can the predicted load be served within the latency budget?)
- Cost budget (does the proposed scale‑up stay under the cost ceiling?)
- Stability guardrails (minimum/maximum replica counts, cooldown periods).
  The controller then emits scaling actions to the Kubernetes API (HPA/VPA or custom pod‑scale CRDs).
Explainability Layer – Every scaling decision is logged with the contributing signals and the reasoning path, enabling operators to audit and debug the behavior.
Evaluation Setup – The authors deployed two representative workloads: a classic microservice e‑commerce stack and an event‑driven order‑processing pipeline, each subjected to bursty, periodic, and random traffic patterns. Baselines included the default HPA, a tuned HPA, and a combination of HPA+VPA.

Results & Findings

Metric	Baseline (tuned HPA)	Proposed Framework
SLO violation duration	100 % of violation windows	↓ 31 %
Scaling response time (time to reach target replicas)	120 s avg	↓ 24 % (≈ 91 s)
Infrastructure cost (CPU‑hour equivalents)	1.00 ×	↓ 18 %
Control stability (frequency of thrashing)	occasional oscillations	No thrashing, clear guardrails

The results indicate that by anticipating demand and explicitly weighing cost, the system can provision just enough pods before latency spikes hit, while avoiding unnecessary over‑provisioning during idle periods.

Practical Implications

For DevOps teams: The framework can be packaged as a Helm chart or Operator, giving operators a single source of truth for both performance and budget compliance.
For developers: Exposing application‑level metrics (e.g., latency percentiles) becomes a first‑class input to the scaling loop, encouraging better observability practices.
Cost‑aware cloud budgeting: Enterprises can enforce per‑namespace or per‑service cost caps directly in the autoscaling policy, reducing surprise spend on cloud bills.
Safety & auditability: The explainability logs satisfy compliance requirements (e.g., SOC 2) by showing why a scaling decision was made, which is often missing in native HPA/VPA.
Portability: Because the forecasting component is lightweight, the approach works on edge clusters or on‑prem Kubernetes installations where heavyweight AI services are impractical.

Limitations & Future Work

Forecasting simplicity – The current time‑series models may struggle with highly irregular spikes (e.g., flash crowds); integrating more sophisticated ML predictors is a planned extension.
Scope of signals – The study focuses on CPU, memory, and latency; incorporating custom business KPIs (e.g., queue depth, error rates) could further refine decisions.
Multi‑cluster coordination – The framework operates within a single cluster; extending it to orchestrate scaling across a fleet of clusters (e.g., for geo‑distribution) remains open.
Operator overhead – While the authors report modest CPU overhead (< 5 % of a node), real‑world large‑scale deployments will need thorough performance profiling.

Bottom line: By marrying SLO‑first thinking with cost awareness and transparent control, this research offers a pragmatic path for organizations to make Kubernetes autoscaling both reliable and budget‑friendly—a win‑win for developers, operators, and finance alike.

Authors

Vinoth Punniyamoorthy
Bikesh Kumar
Sumit Saha
Lokesh Butra
Mayilsamy Palanigounder
Akash Kumar Agarwal
Kabilan Kannan

Paper Information

arXiv ID: 2512.23415v1
Categories: cs.SE, cs.DC
Published: December 29, 2025
PDF: Download PDF

[Paper] An SLO Driven and Cost-Aware Autoscaling Framework for Kubernetes

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Space-Optimal, Computation-Optimal, Topology-Agnostic, Throughput-Scalable Causal Delivery through Hybrid Buffering

[Paper] Konflux: Optimized Function Fusion for Serverless Applications

[Paper] AFLL: Real-time Load Stabilization for MMO Game Servers Based on Circular Causality Learning

[Paper] Breaking the Storage-Bandwidth Tradeoff in Distributed Storage with Quantum Entanglement