[Paper] SpotVista: Availability-Aware Recommendation System for Reliable and Cost-Efficient Multi-Node Spot Instances

Published: 1 day ago (April 27, 2026 at 10:41 AM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.24548v1

Overview

SpotVista tackles a pressing problem for anyone running large‑scale workloads on public clouds: how to keep multi‑node spot fleets both cheap and reliable. By mining the latest “instant availability” feeds that cloud providers now expose, the authors devise a recommendation engine that picks the most cost‑effective combination of spot instances across regions and instance types, while explicitly accounting for the risk of simultaneous interruptions.

Key Contributions

Large‑scale multi‑node availability dataset – built by cleverly bypassing API query limits to capture real‑time spot‑instance health across dozens of regions.
Empirical analysis of multi‑node spot behavior – reveals how interruption patterns differ from single‑node cases and why naïve extensions of existing models fail.
Availability‑aware recommendation algorithm – jointly optimizes for cost and expected uptime, producing a “resource pool” rather than a single instance type.
Extensive real‑world validation – over 1,000 interruption experiments show SpotVista beats the prior state‑of‑the‑art (SpotVerse) and AWS SpotFleet on both stability and cost.
Open‑source tooling – the data collection pipeline and recommendation engine are released for reproducibility and community extension.

Methodology

Data Harvesting – The team continuously polls the public “instant availability” endpoints (e.g., AWS’s DescribeSpotInstanceRequests with the new InstanceAvailability flag) across multiple regions. To stay within the vendor‑imposed request caps, they stagger queries, cache results, and aggregate at the granularity of instance families rather than individual IDs.
Availability Modeling – Using the collected time‑series, they compute per‑region, per‑instance‑type interruption probabilities and, crucially, the joint probability that all nodes in a multi‑node job are interrupted simultaneously. This is modeled with a copula‑based approach that captures correlation between nodes in the same zone or across zones.
Cost‑Benefit Optimization – For a user‑specified workload (e.g., 8 vCPU, 32 GB RAM across 4 nodes), SpotVista enumerates feasible instance‑type combinations, estimates expected hourly cost (spot price × usage) and expected availability (1 – joint interruption probability), then selects the Pareto‑optimal set that maximizes availability under a cost budget (or vice‑versa).
Recommendation Delivery – The final output is a “resource pool” – a list of instance types, counts, and regions – that can be fed directly into orchestration tools like Kubernetes Cluster Autoscaler or AWS Spot Fleet.

Results & Findings

Metric	SpotVista vs. SpotVerse	SpotVista vs. AWS SpotFleet
Availability improvement	+81.28 % (multi‑region workloads)	+21.6 %
Cost savings	+2.84 %	+26.3 %
Mean time between interruptions (MTBI)	4.7 × longer	3.2 × longer
Recommendation latency	< 2 seconds per query	— (offline)

Key takeaways

Multi‑node spot availability is not a simple product of single‑node probabilities; correlated failures (e.g., zone‑wide revocations) dominate.
By explicitly modeling these correlations, SpotVista can avoid “all‑eggs‑in‑one‑basket” configurations that look cheap on paper but are fragile in practice.
The system’s modest extra cost (≈ 3 %) yields a disproportionate boost in stability, making it attractive for latency‑sensitive services.

Practical Implications

Kubernetes & Serverless Operators – SpotVista can feed the autoscaler with a vetted list of node groups, reducing pod evictions and improving SLA adherence.
Data‑Intensive Pipelines – Spark, Flink, or Hadoop clusters can be provisioned on a mixed‑spot pool that guarantees high‑availability checkpoints while shaving up to a quarter off the compute bill.
CI/CD & Testing Environments – Teams can spin up large, temporary testbeds on spot fleets without fearing wholesale shutdowns mid‑run.
Multi‑Cloud Strategies – Because the methodology only requires publicly exposed availability feeds, it can be extended to GCP Preemptible VMs or Azure Spot VMs, enabling cross‑provider cost arbitrage.
Tooling Integration – The open‑source recommendation engine can be wrapped as a Terraform module or a Helm chart, letting DevOps embed cost‑availability trade‑offs directly into IaC pipelines.

Limitations & Future Work

Query Rate Constraints – Despite the clever throttling, the dataset may lag behind rapid price spikes, potentially under‑estimating interruption risk during flash‑sale events.
Static Workload Assumptions – SpotVista currently assumes a fixed resource profile; dynamic scaling patterns (e.g., autoscaling up/down) are not yet modeled.
Vendor‑Specific Features – The approach leans heavily on AWS’s instant‑availability API; adapting to providers with less granular data may require additional heuristics.
Future Directions – The authors plan to incorporate predictive pricing signals, explore reinforcement‑learning‑based recommendation loops, and broaden the system to handle heterogeneous workloads (GPU, FPGA, etc.).

Authors

Taeyoon Kim
Kyumin Kim
Kyunghwan Kim
Hayoung Kim
Seungwoo Jeong
Moohyun Song
Kyungyong Lee

Paper Information

arXiv ID: 2604.24548v1
Categories: cs.DC
Published: April 27, 2026
PDF: Download PDF

[Paper] SpotVista: Availability-Aware Recommendation System for Reliable and Cost-Efficient Multi-Node Spot Instances

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Pythia: Toward Predictability-Driven Agent-Native LLM Serving

[Paper] SpecFed: Accelerating Federated LLM Inference with Speculative Decoding and Compressed Transmission

[Paper] Two Efficient Message-passing Exclusive Scan Algorithms

[Paper] Volitional Multiagent Atomic Transactions: Describing People and their Machines