[Paper] Quantifying Autoscaler Vulnerabilities: An Empirical Study of Resource Misallocation Induced by Cloud Infrastructure Faults

Published: (January 8, 2026 at 02:11 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.04659v1

Overview

The paper investigates a hidden but costly problem in cloud‑native systems: autoscalers can be fooled by faulty infrastructure. When hardware glitches, network hiccups, or software bugs corrupt the performance metrics that autoscalers rely on, they may allocate too many or too few resources. The study quantifies how different fault types impact both vertical (CPU/RAM) and horizontal (instance count) scaling decisions, revealing concrete cost and reliability risks for operators.

Key Contributions

  • Empirical quantification of autoscaler misbehaviour caused by four fault categories (hardware, storage, network, software).
  • Controlled simulation framework that injects realistic metric distortions into popular autoscaling policies (CPU‑based vertical scaling, request‑rate‑based horizontal scaling).
  • Cost impact analysis showing storage‑related faults can add up to $258 / month under horizontal scaling, while routing faults bias toward under‑provisioning.
  • Sensitivity comparison between vertical and horizontal scaling, highlighting that horizontal scaling is more vulnerable to transient anomalies near threshold boundaries.
  • Actionable design guidelines for building fault‑aware autoscaling policies that separate genuine workload spikes from metric artefacts.

Methodology

  1. Fault Injection Engine – The authors built a lightweight simulator that mimics typical cloud workloads (web services, batch jobs) and injects metric errors corresponding to four fault classes:

    • Hardware: CPU throttling, memory errors.
    • Storage: I/O latency spikes, checksum failures.
    • Network: Packet loss, increased round‑trip time.
    • Software: Monitoring agent crashes, timestamp skew.
  2. Autoscaling Policies Tested

    • Vertical: CPU‑threshold based scaling (add/remove vCPU/RAM).
    • Horizontal: Request‑rate threshold based scaling (add/remove VM instances).
  3. Experiment Matrix

    • Three instance sizes (small, medium, large).
    • Three SLO thresholds (tight, moderate, relaxed).
    • Each fault type applied at three severity levels (low, medium, high).
  4. Metrics Collected

    • Provisioning decisions (scale‑up/scale‑down events).
    • Operational cost (based on cloud provider pricing).
    • SLO violation rate (percentage of requests exceeding latency target).
  5. Statistical Analysis – Repeated runs (30 per configuration) and ANOVA tests to isolate the effect of each fault type on cost and reliability.

Results & Findings

Fault TypeScaling ModeTypical Cost OverheadSLO Impact
StorageHorizontal+$258 / month (≈ 22 % increase)Minor (≤ 2 % extra latency)
Network (Routing)Horizontal+$73 / monthConsistent under‑provisioning, leading to 8 % SLO breaches
HardwareVertical+$41 / monthSlight over‑provisioning, negligible SLO effect
Software (monitoring)Both+$19 / monthVariable, depends on detection latency
  • Horizontal scaling reacts sharply to transient metric spikes; a brief storage latency burst can trigger unnecessary instance launches that linger for the scaling cooldown period.
  • Vertical scaling is more tolerant of short‑lived anomalies but suffers from cumulative over‑provisioning when faults persist.
  • Near threshold boundaries (e.g., 70 % CPU), even low‑severity faults cause oscillations (scale‑up → scale‑down) that waste resources.
  • The routing anomaly consistently pushes the autoscaler to think the service is under‑loaded, resulting in chronic under‑provisioning and higher latency.

Practical Implications

  1. Policy Design – Incorporate fault‑aware smoothing (e.g., exponential moving averages with fault detection windows) to dampen reaction to short‑lived metric glitches.
  2. Metric Validation Layer – Deploy lightweight sanity checks (outlier detection, cross‑metric correlation) before feeding data to the autoscaler.
  3. Hybrid Scaling Strategies – Combine vertical and horizontal scaling with complementary thresholds; vertical scaling can absorb transient spikes while horizontal scaling handles sustained load.
  4. Cost Guardrails – Set hard caps on scaling actions triggered within a short interval to prevent runaway provisioning after storage faults.
  5. Observability Enhancements – Tag metrics with provenance (e.g., “from storage subsystem”) so operators can quickly pinpoint the fault source when autoscaling behaves oddly.
  6. SLA Negotiations – When offering cloud‑native SaaS, factor in potential autoscaler mis‑allocations into pricing models or provide “fault‑tolerant autoscaling” as a premium feature.

Limitations & Future Work

  • Simulation‑only: The study uses a controlled simulator; real‑world cloud environments may exhibit more complex fault interdependencies.
  • Limited workload diversity: Experiments focus on typical web‑service patterns; batch‑oriented or event‑driven workloads could react differently.
  • Single‑provider pricing: Cost calculations assume a specific provider’s pricing; results may vary across regions or pricing models (spot, reserved).
  • Future directions proposed by the authors include:
    • Deploying the fault injection framework on a public cloud to validate findings in production.
    • Extending the analysis to machine‑learning‑based autoscalers that use predictive models rather than static thresholds.
    • Investigating cross‑fault scenarios (e.g., simultaneous storage and network faults) and their compound effects on scaling decisions.

Authors

  • Gijun Park

Paper Information

  • arXiv ID: 2601.04659v1
  • Categories: cs.DC
  • Published: January 8, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »