[Paper] Assessing Redundancy Strategies to Improve Availability in Virtualized System Architectures

Published: 2 months ago (November 25, 2025 at 02:16 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2511.20780v1

Overview

The paper by Silva and Callou tackles a very practical problem for anyone running private‑cloud storage services: how to keep a file‑server up and running despite hardware or software failures. By modeling a Nextcloud deployment on Apache CloudStack with Stochastic Petri Nets, the authors quantify the availability gains of different redundancy schemes and give cloud operators a data‑driven way to choose the right architecture.

Key Contributions

A systematic SPN‑based methodology for evaluating availability of virtualized storage services.
Four concrete architectural models (baseline, host‑level redundancy, VM‑level redundancy, and combined host + VM redundancy).
Quantitative comparison of expected downtime and availability percentages for each model.
Guidelines for private‑cloud designers on where to invest redundancy resources for maximal impact.

Methodology

Scenario definition – The authors set up a private cloud using Apache CloudStack and deployed a Nextcloud file server on top of it.
Failure/repair modeling – Each component (physical host, hypervisor, VM, network link, storage) is represented as a place in a Stochastic Petri Net, with exponential rates for failure and repair derived from realistic hardware statistics.
Redundancy configurations – Four SPN models are built:
- Baseline: single host, single VM.
- Host‑level: two physical hosts running the same VM (active‑passive failover).
- VM‑level: two VMs on the same host with load‑balancing.
- Combined: two hosts each running a redundant VM (active‑active).
Analysis – Using standard SPN solution techniques (steady‑state probability calculation), the authors compute the availability (probability the service is up) and expected downtime per year for each configuration.

The approach stays high‑level enough for engineers to replicate: you only need component failure rates and a Petri‑net solver (many open‑source tools exist).

Results & Findings

Configuration	Availability (≈)	Expected Downtime / yr
Baseline	99.5 %	~44 h
Host‑level	99.9 %	~8 h
VM‑level	99.8 %	~12 h
Combined	99.99 %	~0.9 h

Host‑level redundancy yields the biggest single‑step improvement because it eliminates a whole point of failure (the physical server).
VM‑level redundancy also helps, but its benefit is capped by the underlying host’s reliability.
Combining both pushes availability into “five‑nines” territory, cutting downtime by more than an order of magnitude compared with the baseline.

The numbers are illustrative; exact percentages will vary with hardware MTBF/MTTR, but the relative ordering holds across a wide range of realistic parameters.

Practical Implications

Design decisions – Cloud architects can now justify the cost of an extra host or an extra VM with concrete availability ROI numbers.
SLA negotiations – Service providers can back up “five‑nines” availability claims with a reproducible model rather than vague best‑practice statements.
Capacity planning – Knowing the expected downtime helps IT budgeting (e.g., estimating lost productivity or compensation for downtime).
Tooling – The SPN framework can be integrated into CI pipelines: after any infrastructure change, automatically re‑run the model to verify that availability targets are still met.
Open‑source friendliness – Since the study uses Nextcloud and Apache CloudStack—both free projects—small‑to‑medium enterprises can adopt the same methodology without licensing hurdles.

Limitations & Future Work

Simplified failure distributions – The model assumes exponential failure/repair times; real hardware may exhibit Weibull or log‑normal behavior.
Scope of components – Network switches, storage back‑ends, and external dependencies (DNS, authentication services) are abstracted away, potentially under‑estimating failure modes.
Performance impact – The study focuses on availability, not on latency or throughput penalties introduced by redundancy mechanisms.

Future research could extend the SPN models to incorporate non‑exponential failure data, evaluate performance trade‑offs, and explore automated optimization (e.g., selecting the cheapest redundancy mix that meets a target SLA).

Authors

Alison Silva
Gustavo Callou

Paper Information

arXiv ID: 2511.20780v1
Categories: cs.DC
Published: November 25, 2025
PDF: Download PDF

[Paper] Assessing Redundancy Strategies to Improve Availability in Virtualized System Architectures

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

AWS re:Invent 2025: How to watch and follow along live

Defend Post-Quantum Cryptography's “Harvest Now, Decrypt Later” with WAAP

Improve service reliability and ops culture with Grafana Cloud Service Center

Network Namespaces: Isolating VM Networking