[Paper] Assessing Redundancy Strategies to Improve Availability in Virtualized System Architectures

Published: (November 25, 2025 at 02:16 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2511.20780v1

Overview

The paper by Silva and Callou tackles a very practical problem for anyone running private‑cloud storage services: how to keep a file‑server up and running despite hardware or software failures. By modeling a Nextcloud deployment on Apache CloudStack with Stochastic Petri Nets, the authors quantify the availability gains of different redundancy schemes and give cloud operators a data‑driven way to choose the right architecture.

Key Contributions

  • A systematic SPN‑based methodology for evaluating availability of virtualized storage services.
  • Four concrete architectural models (baseline, host‑level redundancy, VM‑level redundancy, and combined host + VM redundancy).
  • Quantitative comparison of expected downtime and availability percentages for each model.
  • Guidelines for private‑cloud designers on where to invest redundancy resources for maximal impact.

Methodology

  1. Scenario definition – The authors set up a private cloud using Apache CloudStack and deployed a Nextcloud file server on top of it.
  2. Failure/repair modeling – Each component (physical host, hypervisor, VM, network link, storage) is represented as a place in a Stochastic Petri Net, with exponential rates for failure and repair derived from realistic hardware statistics.
  3. Redundancy configurations – Four SPN models are built:
    • Baseline: single host, single VM.
    • Host‑level: two physical hosts running the same VM (active‑passive failover).
    • VM‑level: two VMs on the same host with load‑balancing.
    • Combined: two hosts each running a redundant VM (active‑active).
  4. Analysis – Using standard SPN solution techniques (steady‑state probability calculation), the authors compute the availability (probability the service is up) and expected downtime per year for each configuration.

The approach stays high‑level enough for engineers to replicate: you only need component failure rates and a Petri‑net solver (many open‑source tools exist).

Results & Findings

ConfigurationAvailability (≈)Expected Downtime / yr
Baseline99.5 %~44 h
Host‑level99.9 %~8 h
VM‑level99.8 %~12 h
Combined99.99 %~0.9 h
  • Host‑level redundancy yields the biggest single‑step improvement because it eliminates a whole point of failure (the physical server).
  • VM‑level redundancy also helps, but its benefit is capped by the underlying host’s reliability.
  • Combining both pushes availability into “five‑nines” territory, cutting downtime by more than an order of magnitude compared with the baseline.

The numbers are illustrative; exact percentages will vary with hardware MTBF/MTTR, but the relative ordering holds across a wide range of realistic parameters.

Practical Implications

  • Design decisions – Cloud architects can now justify the cost of an extra host or an extra VM with concrete availability ROI numbers.
  • SLA negotiations – Service providers can back up “five‑nines” availability claims with a reproducible model rather than vague best‑practice statements.
  • Capacity planning – Knowing the expected downtime helps IT budgeting (e.g., estimating lost productivity or compensation for downtime).
  • Tooling – The SPN framework can be integrated into CI pipelines: after any infrastructure change, automatically re‑run the model to verify that availability targets are still met.
  • Open‑source friendliness – Since the study uses Nextcloud and Apache CloudStack—both free projects—small‑to‑medium enterprises can adopt the same methodology without licensing hurdles.

Limitations & Future Work

  • Simplified failure distributions – The model assumes exponential failure/repair times; real hardware may exhibit Weibull or log‑normal behavior.
  • Scope of components – Network switches, storage back‑ends, and external dependencies (DNS, authentication services) are abstracted away, potentially under‑estimating failure modes.
  • Performance impact – The study focuses on availability, not on latency or throughput penalties introduced by redundancy mechanisms.

Future research could extend the SPN models to incorporate non‑exponential failure data, evaluate performance trade‑offs, and explore automated optimization (e.g., selecting the cheapest redundancy mix that meets a target SLA).

Authors

  • Alison Silva
  • Gustavo Callou

Paper Information

  • arXiv ID: 2511.20780v1
  • Categories: cs.DC
  • Published: November 25, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

Terraform Project: Simple EC2 + Security Group

Project Structure terraform-project/ │── main.tf │── variables.tf │── outputs.tf │── providers.tf │── terraform.tfvars │── modules/ │ └── ec2/ │ ├── main.tf │...

Saving Terraform State in S3

Configuring S3 as a Terraform Backend Terraform can store its state in an S3 bucket. Below is a minimal configuration that sets up the S3 backend: hcl terrafor...