Design Highly Available And / Or Fault-Tolerant Architectures
Source: Dev.to
Exam Guide – Solutions Architect – Associate
Domain 2: Design Secure Architectures
Task Statement 2.2 – Designing Highly Available and Fault‑Tolerant Architectures
High Availability (HA) – the system stays up through component failures.
Fault Tolerance (FT) – the system continues operating with no interruption.
Typical HA pattern – Multi‑AZ + load balancing + managed services + no single points of failure.
1️⃣ AWS Global Infrastructure
| Component | Description |
|---|---|
| Availability Zones (AZs) | Isolated failure domains within a Region. |
| Regions | Separate geographic areas – used for disaster‑recovery (DR). |
| Amazon Route 53 | DNS‑based routing & health checks; common for regional failover. |
- “Must survive an AZ failure” → Multi‑AZ design.
- “Must survive a regional outage” → Multi‑Region DR + Route 53 failover.
2️⃣ AWS Managed Services – Appropriate Use Cases
- Managed services often include built‑in HA, scaling, and reduced operational risk.
- Even if a service (e.g., Comprehend, Polly) isn’t an HA topic itself, the exam expects you to prefer managed services when you need higher reliability with less custom work.
3️⃣ Basic Networking Concepts
| Element | HA/FT Considerations |
|---|---|
| Route Tables | Correct routing is essential. |
| Public Subnets | Route to an Internet Gateway (IGW). |
| Private Subnets | Outbound traffic via NAT Gateway. |
| Multi‑AZ Designs | Each AZ needs its own subnet & routing. |
4️⃣ Disaster‑Recovery (DR) Strategies
| DR Strategy | What It Is | Typical RTO / RPO | Cost |
|---|---|---|---|
| Backup & Restore | Restore from backups into a new environment. | Slow RTO, higher RPO | Lowest |
| Pilot Light | Minimal core services running (e.g., DB + minimal infra). | Medium RTO, medium RPO | Low–Medium |
| Warm Standby | Scaled‑down but fully functional stack always running. | Faster RTO, low RPO | Medium–High |
| Active‑Active | Both Regions serve traffic simultaneously. | Lowest RTO/RPO | Highest |
Tip: When RTO/RPO are strict, lean toward Warm Standby or Active‑Active.
5️⃣ Distributed Design Patterns
- Retry with back‑off – avoids thundering‑herd.
- Timeouts – prevent resource exhaustion.
- Circuit breaker / Bulkhead – limit cascade failures.
- Queue‑based load leveling – e.g., Amazon SQS.
- Idempotency – safe retries.
- Multi‑AZ deployment – for every critical tier.
6️⃣ Failover Strategies
- Load‑balancer failover – across targets in multiple AZs (within a Region).
- Database failover – e.g., RDS Multi‑AZ.
- DNS failover – Route 53 health checks across Regions.
- Client‑side failover – applications try secondary endpoints.
“Fail over between Regions” → Route 53 failover routing (or latency‑based + health checks).
Immutable Deployments
- Build a new AMI / container image.
- Deploy new instances / tasks.
- Terminate old ones.
Benefits
- Consistency.
- Faster recovery.
- Lower configuration drift.
Best practice: Combine IaC (CloudFormation, CDK, Terraform) with immutable deployments to “ensure infrastructure integrity and repeatability”.
8️⃣ Load‑Balancing Concepts – Application Load Balancer (ALB)
- Distributes traffic across targets in multiple AZs.
- Eliminates single‑instance SPOFs.
9️⃣ Proxy Concepts – Amazon RDS Proxy
Helps reliability for spiky or serverless workloads by:
- Pooling & reusing DB connections.
- Reducing DB overload from connection storms.
- Improving failover behavior for supported patterns.
Use case: Lambda functions causing too many DB connections → RDS Proxy.
🔟 Service Quotas & Throttling – Standby Environments
- In DR scenarios, the standby Region/account must have sufficient quotas to scale up.
- Actions:
- Check & adjust Service Quotas.
- Design for throttling with retries, back‑off, and buffering.
1️⃣1️⃣ Storage Options & Characteristics – Durability & Replication
| Service | Durability / Replication |
|---|---|
| Amazon S3 | Highly durable, regional; supports versioning & replication. |
| Amazon EBS | Replicated within an AZ; snapshots can be stored in S3 for durability. |
| Amazon EFS | Regional, multi‑AZ within a Region. |
1️⃣2️⃣ Workload Visibility – AWS X‑Ray
- CloudWatch – metrics & alarms for health & scaling.
- AWS X‑Ray – tracing distributed requests, pinpointing bottlenecks.
Visibility is essential for HA: it helps you detect and diagnose failures quickly.
A️⃣ Determine Automation Strategies to Ensure Infrastructure Integrity
Look for:
- Infrastructure as Code – CloudFormation, CDK, Terraform.
- Automated deployments – blue/green, rolling.
- Auto Scaling + health checks.
- Automated recovery actions – replace unhealthy instances/tasks.
Common architecture choices
- Multi‑AZ: ALB + Auto Scaling + Multi‑AZ database (RDS Multi‑AZ).
- Multi‑Region: Route 53 + replicated data + standby/active environment.
“AZ outage must not cause downtime” → Multi‑AZ everything.
Key Operational Metrics
- Availability / error rate (5xx).
- Latency p95 / p99.
- Queue depth / age (SQS).
- CPU / memory / connections (compute & DB).
- RPO / RTO compliance signals (backup success, replication lag).
D️⃣ Implement Designs to Mitigate Single Points of Failure
- Multi‑AZ deployments.
- Redundant NAT Gateways – one per AZ (best practice).
- Multi‑AZ databases.
- Avoid single‑instance “pet” servers.
E️⃣ Ensure Durability and Availability of Data – Backups
- Automated backups (RDS).
- Snapshots (EBS, RDS).
- S3 versioning + replication where required.
- AWS Backup policies for centralized backup.
Select a backup strategy based on RTO/RPO:
| Strategy | Cost | RTO / RPO |
|---|---|---|
| Backup/Restore | Low | Slow / Higher |
| Pilot Light | Low‑Medium | Medium |
| Warm Standby | Medium‑High | Faster / Low |
| Active‑Active | Highest | Fastest / Lowest |
G️⃣ Improve Reliability of Legacy Applications
When you cannot change the application code, use infrastructure patterns:
- Place the app behind an ALB.
- Use Auto Scaling groups to replace failed instances automatically.
- Deploy RDS Proxy to stabilize DB connections.
- Add caching (e.g., ElastiCache) to reduce backend load.
- Configure DNS failover (Route 53) for regional resilience.
Disaster Recovery (DR) – Managed Services to Reduce Failure Modes
- Application Layer: ALB, Auto Scaling, Route 53
- Data Layer: RDS Multi‑AZ, DynamoDB (managed HA)
- Messaging: SQS / SNS for decoupling spikes and failures
- Edge: CloudFront for edge caching and origin protection
Requirements & Directions
| Scenario | Recommended AWS Services |
|---|---|
| Survive an instance failure | Auto Scaling + health checks + ALB |
| Survive an AZ failure | Multi‑AZ for each tier (ALB targets across AZs, Multi‑AZ DB) |
| Survive a Region failure | DR strategy + Route 53 failover + replicated data |
| Strict RTO / RPO | Warm standby or active‑active |
| Lambda overwhelms RDS with connections | RDS Proxy |
| Need to see bottlenecks across microservices | X‑Ray (plus CloudWatch) |
| Standby must scale during failover | Plan Service Quotas + scaling policies |
Checklist
- Every critical tier is deployed across multiple AZs
- Traffic is distributed via ALB/NLB and unhealthy targets are replaced automatically
- Databases use HA features (e.g., RDS Multi‑AZ or managed HA services)
- DR strategy matches business RTO/RPO (backup/restore vs pilot‑light vs warm standby vs active‑active)
- Regional failover uses Route 53 health checks/routing (when required)
- Data durability is addressed (backups, snapshots, replication)
- Quotas and throttling are considered for failover/standby scaling
- Monitoring and tracing exist (CloudWatch + X‑Ray)
Primary AWS Documentation (for reference)
- Route 53 – DNS routing & health checks
- Disaster Recovery on AWS – design patterns & best practices
- VPC Route Tables – network traffic flow
- Application Load Balancer – layer‑7 load balancing
- Auto Scaling (EC2) – dynamic instance scaling
- RDS Multi‑AZ – high‑availability database deployments
- RDS Proxy – connection pooling for serverless workloads
- Service Quotas – limits & scaling considerations
- S3 Replication – cross‑region object durability
- EBS Snapshots – point‑in‑time volume backups
- EFS Overview – shared file storage across AZs
- AWS X‑Ray – distributed tracing
- CloudWatch – metrics, logs, alarms
- Amazon Comprehend – natural‑language processing (optional)
- Amazon Polly – text‑to‑speech (optional)