[Paper] SpotVista: Availability-Aware Recommendation System for Reliable and Cost-Efficient Multi-Node Spot Instances

Published: (April 27, 2026 at 10:41 AM EDT)
4 min read
Source: arXiv

Source: arXiv - 2604.24548v1

Overview

SpotVista tackles a pressing problem for anyone running large‑scale workloads on public clouds: how to keep multi‑node spot fleets both cheap and reliable. By mining the latest “instant availability” feeds that cloud providers now expose, the authors devise a recommendation engine that picks the most cost‑effective combination of spot instances across regions and instance types, while explicitly accounting for the risk of simultaneous interruptions.

Key Contributions

  • Large‑scale multi‑node availability dataset – built by cleverly bypassing API query limits to capture real‑time spot‑instance health across dozens of regions.
  • Empirical analysis of multi‑node spot behavior – reveals how interruption patterns differ from single‑node cases and why naïve extensions of existing models fail.
  • Availability‑aware recommendation algorithm – jointly optimizes for cost and expected uptime, producing a “resource pool” rather than a single instance type.
  • Extensive real‑world validation – over 1,000 interruption experiments show SpotVista beats the prior state‑of‑the‑art (SpotVerse) and AWS SpotFleet on both stability and cost.
  • Open‑source tooling – the data collection pipeline and recommendation engine are released for reproducibility and community extension.

Methodology

  1. Data Harvesting – The team continuously polls the public “instant availability” endpoints (e.g., AWS’s DescribeSpotInstanceRequests with the new InstanceAvailability flag) across multiple regions. To stay within the vendor‑imposed request caps, they stagger queries, cache results, and aggregate at the granularity of instance families rather than individual IDs.
  2. Availability Modeling – Using the collected time‑series, they compute per‑region, per‑instance‑type interruption probabilities and, crucially, the joint probability that all nodes in a multi‑node job are interrupted simultaneously. This is modeled with a copula‑based approach that captures correlation between nodes in the same zone or across zones.
  3. Cost‑Benefit Optimization – For a user‑specified workload (e.g., 8 vCPU, 32 GB RAM across 4 nodes), SpotVista enumerates feasible instance‑type combinations, estimates expected hourly cost (spot price × usage) and expected availability (1 – joint interruption probability), then selects the Pareto‑optimal set that maximizes availability under a cost budget (or vice‑versa).
  4. Recommendation Delivery – The final output is a “resource pool” – a list of instance types, counts, and regions – that can be fed directly into orchestration tools like Kubernetes Cluster Autoscaler or AWS Spot Fleet.

Results & Findings

MetricSpotVista vs. SpotVerseSpotVista vs. AWS SpotFleet
Availability improvement+81.28 % (multi‑region workloads)+21.6 %
Cost savings+2.84 %+26.3 %
Mean time between interruptions (MTBI)4.7 × longer3.2 × longer
Recommendation latency< 2 seconds per query— (offline)

Key takeaways

  • Multi‑node spot availability is not a simple product of single‑node probabilities; correlated failures (e.g., zone‑wide revocations) dominate.
  • By explicitly modeling these correlations, SpotVista can avoid “all‑eggs‑in‑one‑basket” configurations that look cheap on paper but are fragile in practice.
  • The system’s modest extra cost (≈ 3 %) yields a disproportionate boost in stability, making it attractive for latency‑sensitive services.

Practical Implications

  • Kubernetes & Serverless Operators – SpotVista can feed the autoscaler with a vetted list of node groups, reducing pod evictions and improving SLA adherence.
  • Data‑Intensive Pipelines – Spark, Flink, or Hadoop clusters can be provisioned on a mixed‑spot pool that guarantees high‑availability checkpoints while shaving up to a quarter off the compute bill.
  • CI/CD & Testing Environments – Teams can spin up large, temporary testbeds on spot fleets without fearing wholesale shutdowns mid‑run.
  • Multi‑Cloud Strategies – Because the methodology only requires publicly exposed availability feeds, it can be extended to GCP Preemptible VMs or Azure Spot VMs, enabling cross‑provider cost arbitrage.
  • Tooling Integration – The open‑source recommendation engine can be wrapped as a Terraform module or a Helm chart, letting DevOps embed cost‑availability trade‑offs directly into IaC pipelines.

Limitations & Future Work

  • Query Rate Constraints – Despite the clever throttling, the dataset may lag behind rapid price spikes, potentially under‑estimating interruption risk during flash‑sale events.
  • Static Workload Assumptions – SpotVista currently assumes a fixed resource profile; dynamic scaling patterns (e.g., autoscaling up/down) are not yet modeled.
  • Vendor‑Specific Features – The approach leans heavily on AWS’s instant‑availability API; adapting to providers with less granular data may require additional heuristics.
  • Future Directions – The authors plan to incorporate predictive pricing signals, explore reinforcement‑learning‑based recommendation loops, and broaden the system to handle heterogeneous workloads (GPU, FPGA, etc.).

Authors

  • Taeyoon Kim
  • Kyumin Kim
  • Kyunghwan Kim
  • Hayoung Kim
  • Seungwoo Jeong
  • Moohyun Song
  • Kyungyong Lee

Paper Information

  • arXiv ID: 2604.24548v1
  • Categories: cs.DC
  • Published: April 27, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »