Design Highly Available And / Or Fault-Tolerant Architectures

Published: (February 5, 2026 at 03:04 AM EST)
5 min read
Source: Dev.to

Source: Dev.to

Exam Guide – Solutions Architect – Associate

Domain 2: Design Secure Architectures

Task Statement 2.2 – Designing Highly Available and Fault‑Tolerant Architectures

High Availability (HA) – the system stays up through component failures.
Fault Tolerance (FT) – the system continues operating with no interruption.

Typical HA pattern – Multi‑AZ + load balancing + managed services + no single points of failure.

1️⃣ AWS Global Infrastructure

ComponentDescription
Availability Zones (AZs)Isolated failure domains within a Region.
RegionsSeparate geographic areas – used for disaster‑recovery (DR).
Amazon Route 53DNS‑based routing & health checks; common for regional failover.
  • “Must survive an AZ failure” → Multi‑AZ design.
  • “Must survive a regional outage” → Multi‑Region DR + Route 53 failover.

2️⃣ AWS Managed Services – Appropriate Use Cases

  • Managed services often include built‑in HA, scaling, and reduced operational risk.
  • Even if a service (e.g., Comprehend, Polly) isn’t an HA topic itself, the exam expects you to prefer managed services when you need higher reliability with less custom work.

3️⃣ Basic Networking Concepts

ElementHA/FT Considerations
Route TablesCorrect routing is essential.
Public SubnetsRoute to an Internet Gateway (IGW).
Private SubnetsOutbound traffic via NAT Gateway.
Multi‑AZ DesignsEach AZ needs its own subnet & routing.

4️⃣ Disaster‑Recovery (DR) Strategies

DR StrategyWhat It IsTypical RTO / RPOCost
Backup & RestoreRestore from backups into a new environment.Slow RTO, higher RPOLowest
Pilot LightMinimal core services running (e.g., DB + minimal infra).Medium RTO, medium RPOLow–Medium
Warm StandbyScaled‑down but fully functional stack always running.Faster RTO, low RPOMedium–High
Active‑ActiveBoth Regions serve traffic simultaneously.Lowest RTO/RPOHighest

Tip: When RTO/RPO are strict, lean toward Warm Standby or Active‑Active.

5️⃣ Distributed Design Patterns

  1. Retry with back‑off – avoids thundering‑herd.
  2. Timeouts – prevent resource exhaustion.
  3. Circuit breaker / Bulkhead – limit cascade failures.
  4. Queue‑based load leveling – e.g., Amazon SQS.
  5. Idempotency – safe retries.
  6. Multi‑AZ deployment – for every critical tier.

6️⃣ Failover Strategies

  1. Load‑balancer failover – across targets in multiple AZs (within a Region).
  2. Database failover – e.g., RDS Multi‑AZ.
  3. DNS failover – Route 53 health checks across Regions.
  4. Client‑side failover – applications try secondary endpoints.

“Fail over between Regions” → Route 53 failover routing (or latency‑based + health checks).

Immutable Deployments

  1. Build a new AMI / container image.
  2. Deploy new instances / tasks.
  3. Terminate old ones.

Benefits

  • Consistency.
  • Faster recovery.
  • Lower configuration drift.

Best practice: Combine IaC (CloudFormation, CDK, Terraform) with immutable deployments to “ensure infrastructure integrity and repeatability”.

8️⃣ Load‑Balancing Concepts – Application Load Balancer (ALB)

  • Distributes traffic across targets in multiple AZs.
  • Eliminates single‑instance SPOFs.

9️⃣ Proxy Concepts – Amazon RDS Proxy

Helps reliability for spiky or serverless workloads by:

  1. Pooling & reusing DB connections.
  2. Reducing DB overload from connection storms.
  3. Improving failover behavior for supported patterns.

Use case: Lambda functions causing too many DB connections → RDS Proxy.

🔟 Service Quotas & Throttling – Standby Environments

  • In DR scenarios, the standby Region/account must have sufficient quotas to scale up.
  • Actions:
    • Check & adjust Service Quotas.
    • Design for throttling with retries, back‑off, and buffering.

1️⃣1️⃣ Storage Options & Characteristics – Durability & Replication

ServiceDurability / Replication
Amazon S3Highly durable, regional; supports versioning & replication.
Amazon EBSReplicated within an AZ; snapshots can be stored in S3 for durability.
Amazon EFSRegional, multi‑AZ within a Region.

1️⃣2️⃣ Workload Visibility – AWS X‑Ray

  • CloudWatch – metrics & alarms for health & scaling.
  • AWS X‑Ray – tracing distributed requests, pinpointing bottlenecks.

Visibility is essential for HA: it helps you detect and diagnose failures quickly.

A️⃣ Determine Automation Strategies to Ensure Infrastructure Integrity

Look for:

  1. Infrastructure as Code – CloudFormation, CDK, Terraform.
  2. Automated deployments – blue/green, rolling.
  3. Auto Scaling + health checks.
  4. Automated recovery actions – replace unhealthy instances/tasks.

Common architecture choices

  • Multi‑AZ: ALB + Auto Scaling + Multi‑AZ database (RDS Multi‑AZ).
  • Multi‑Region: Route 53 + replicated data + standby/active environment.

“AZ outage must not cause downtime” → Multi‑AZ everything.

Key Operational Metrics

  1. Availability / error rate (5xx).
  2. Latency p95 / p99.
  3. Queue depth / age (SQS).
  4. CPU / memory / connections (compute & DB).
  5. RPO / RTO compliance signals (backup success, replication lag).

D️⃣ Implement Designs to Mitigate Single Points of Failure

  • Multi‑AZ deployments.
  • Redundant NAT Gateways – one per AZ (best practice).
  • Multi‑AZ databases.
  • Avoid single‑instance “pet” servers.

E️⃣ Ensure Durability and Availability of Data – Backups

  1. Automated backups (RDS).
  2. Snapshots (EBS, RDS).
  3. S3 versioning + replication where required.
  4. AWS Backup policies for centralized backup.

Select a backup strategy based on RTO/RPO:

StrategyCostRTO / RPO
Backup/RestoreLowSlow / Higher
Pilot LightLow‑MediumMedium
Warm StandbyMedium‑HighFaster / Low
Active‑ActiveHighestFastest / Lowest

G️⃣ Improve Reliability of Legacy Applications

When you cannot change the application code, use infrastructure patterns:

  1. Place the app behind an ALB.
  2. Use Auto Scaling groups to replace failed instances automatically.
  3. Deploy RDS Proxy to stabilize DB connections.
  4. Add caching (e.g., ElastiCache) to reduce backend load.
  5. Configure DNS failover (Route 53) for regional resilience.

Disaster Recovery (DR) – Managed Services to Reduce Failure Modes

  • Application Layer: ALB, Auto Scaling, Route 53
  • Data Layer: RDS Multi‑AZ, DynamoDB (managed HA)
  • Messaging: SQS / SNS for decoupling spikes and failures
  • Edge: CloudFront for edge caching and origin protection

Requirements & Directions

ScenarioRecommended AWS Services
Survive an instance failureAuto Scaling + health checks + ALB
Survive an AZ failureMulti‑AZ for each tier (ALB targets across AZs, Multi‑AZ DB)
Survive a Region failureDR strategy + Route 53 failover + replicated data
Strict RTO / RPOWarm standby or active‑active
Lambda overwhelms RDS with connectionsRDS Proxy
Need to see bottlenecks across microservicesX‑Ray (plus CloudWatch)
Standby must scale during failoverPlan Service Quotas + scaling policies

Checklist

  • Every critical tier is deployed across multiple AZs
  • Traffic is distributed via ALB/NLB and unhealthy targets are replaced automatically
  • Databases use HA features (e.g., RDS Multi‑AZ or managed HA services)
  • DR strategy matches business RTO/RPO (backup/restore vs pilot‑light vs warm standby vs active‑active)
  • Regional failover uses Route 53 health checks/routing (when required)
  • Data durability is addressed (backups, snapshots, replication)
  • Quotas and throttling are considered for failover/standby scaling
  • Monitoring and tracing exist (CloudWatch + X‑Ray)

Primary AWS Documentation (for reference)

  • Route 53 – DNS routing & health checks
  • Disaster Recovery on AWS – design patterns & best practices
  • VPC Route Tables – network traffic flow
  • Application Load Balancer – layer‑7 load balancing
  • Auto Scaling (EC2) – dynamic instance scaling
  • RDS Multi‑AZ – high‑availability database deployments
  • RDS Proxy – connection pooling for serverless workloads
  • Service Quotas – limits & scaling considerations
  • S3 Replication – cross‑region object durability
  • EBS Snapshots – point‑in‑time volume backups
  • EFS Overview – shared file storage across AZs
  • AWS X‑Ray – distributed tracing
  • CloudWatch – metrics, logs, alarms
  • Amazon Comprehend – natural‑language processing (optional)
  • Amazon Polly – text‑to‑speech (optional)
Back to Blog

Related posts

Read more »

Ghibli moment for 3D Printing

Getting Started I bought my first 3D printer this week to make parts for the robot I'm building. The print head moves slowly, laying down each thin line of pla...