AWS re:Invent 2025 - Global Resilient Apps: Guide to Multi-AZ/Region Architecture with ELB (NET311)
Source: Dev.to
Introduction
In this session, Jon Zobrist (Head of Customer Success, ELB) and Felipe da Silva (Principal Solutions Architect, ELB) discuss strategies for building resilient applications using AWS Elastic Load Balancing (ELB). They cover multi‑AZ and multi‑region resiliency, DNS‑based failover, health‑check configurations, and best practices for deployment and change management.
“Everything fails, all the time.” – Werner Vogels
Resilience is defined by the AWS Resilience Hub as the ability of a workload to recover from infrastructure or service disruptions. The discussion focuses on technical drivers (availability, latency, downtime) and business drivers (revenue continuity, customer trust).
Why Resilience Matters
-
Common failure sources
- Configuration and deployment errors – the leading cause of outages.
- Infrastructure issues – racks, servers, power, cooling.
- Data problems – corruption or incomplete replication.
- Extreme events – earthquakes, floods, tsunamis (rare but possible).
-
Resilience vs. Disaster Recovery
- Disaster Recovery emphasizes backup and restore processes, often with recovery times measured in hours or days.
- High Availability (HA) keeps primary and secondary sites live, enabling rapid failover (seconds to minutes) and supporting active‑active architectures.
-
Shared responsibility – AWS provides the global infrastructure (multiple regions and Availability Zones), while customers design, configure, and operate resilient workloads.
Core AWS Primitives
- Regions – geographically isolated collections of data centers.
- Availability Zones (AZs) – independent infrastructure within a region (separate buildings, power, networking) with low‑latency connectivity (single‑digit ms).
These primitives enable you to distribute resources for fault isolation and rapid recovery.
Multi‑AZ Resiliency
Typical Architecture
- Multiple EC2 instances spread across AZs.
- A primary database (often single‑AZ) with optional secondary replica.
Failure Scenarios
- Database failure – requires a full database failover, causing a temporary dip in availability.
- Front‑door (ELB) degradation – a subset of hosts in one AZ cannot reach the database while others remain healthy. The application stays up, but users experience degraded performance.
ELB Benefits
- Transparent scaling – automatically adds or removes capacity.
- Health checks – continuously probes targets and removes unhealthy ones from rotation.
- DNS‑based routing – updates DNS records to point to healthy AZs, reducing the impact of localized failures.
Design Recommendations
| Recommendation | Description |
|---|---|
| Pre‑provision capacity | Ensure enough capacity in each AZ to handle the loss of at least one AZ without performance loss. |
| Cross‑zone load balancing | Enable to distribute traffic evenly across AZs, but be aware of the additional inter‑AZ data transfer cost. |
| Configurable health thresholds | Set target‑group health thresholds (e.g., fail after 50 % of targets unhealthy) to trigger early failover before a total outage. |
| Route 53 health checks | Combine ELB health checks with Route 53 health checks for DNS‑level failover. |
| Application Recovery Controller | Use for zonal shifts and automated failover orchestration. |
| Honor DNS TTLs | Clients should respect the TTL (typically 60 seconds) to receive updated DNS records quickly. |
| Connection pooling | Reduce the impact of DNS changes by reusing existing connections where possible. |
Multi‑Region Resilience
Strategies
- Route 53 failover records – primary region serves traffic; secondary region takes over when health checks fail.
- Weighted routing – split traffic across regions for load balancing and gradual migration.
- DNS load shedding – limit DNS responses during large‑scale incidents to avoid congestive collapse.
Deployment Practices
- Zonal rollouts – deploy new versions to one AZ at a time, validate health, then expand.
- Graceful degradation – design services to continue operating with reduced functionality when downstream dependencies are unavailable.
- Testing & change management – simulate failures (e.g., chaos engineering) to verify that routing and failover behave as expected.
Client‑Side Best Practices
- Respect DNS TTLs – ensures rapid adoption of new endpoint addresses.
- Implement connection pooling – reduces the number of new connections needed after a DNS change.
Operational Takeaways
- Configuration and deployment changes are the top cause of outages – invest in automated testing, canary releases, and robust change‑management processes.
- Health‑check tuning is critical – balance sensitivity (fast detection) against flapping (false positives).
- Monitoring and observability – track ELB metrics, Route 53 health status, and application latency to detect issues early.
- Regular disaster‑recovery drills – validate both multi‑AZ and multi‑region failover paths.
Closing
Jon and Felipe emphasize that resilience is an ongoing practice. By leveraging ELB’s health checks, Route 53 failover capabilities, and thoughtful architecture across AZs and regions, you can build applications that stay available even when components fail.
For further details, explore the AWS Resilience Hub and the ELB documentation.