What the AWS us-east-1 Outage Taught Me About Building Resilient Systems
Source: Dev.to
AWS us‑east-1 will go down again. When it does, will your system survive?
This past weekend I built a system designed to survive such an outage.
After eight years building subscription infrastructure at Surfline—processing payments through Stripe, Apple, and Google Play—I’ve learned that the real question isn’t if your cloud provider will fail, but how your architecture degrades when it does.
I spent four hours implementing three reliability patterns sourced from the AWS Builders’ Library, Google SRE practices, and Stripe’s engineering blog. Below are the key take‑aways.
Why Resilience Matters
When AWS experiences an incident, common failure modes include:
- Lambda functions timing out
- DynamoDB calls failing or throttling
- SQS queues backing up
For most applications users simply see an error page and retry later. Payment systems are different:
- A failed charge might actually have succeeded.
- A retry could double‑charge a customer.
- A thundering herd of retries can cascade the failure.
Therefore we need patterns that handle partial failures without losing money or trust.
Pattern 1: Retry with Jitter
The AWS Builders’ Library article on Timeouts, retries, and backoff with jitter changed how I think about retry logic. Without jitter, all clients retry at the exact same intervals, creating synchronized waves that hammer a recovering service.
// Full jitter formula from AWS Builders' Library
const calculateDelay = (attempt: number): number => {
const exponentialDelay = Math.min(
MAX_DELAY,
INITIAL_DELAY * Math.pow(2, attempt)
);
// Full jitter: random value between 0 and exponential delay
return Math.random() * exponentialDelay;
};
Result: In load tests the success rate jumped from ~70 % to >99 % because the jitter spreads retry load evenly across time instead of creating spikes.
Where It Helps
- Lambda retrying DynamoDB during throttling
- ECS tasks calling external APIs through a NAT gateway
- Step Functions with retry policies on service integrations
Pattern 2: Bounded Queue + Worker Pool
A bounded queue alone does not limit concurrent processing. In a test I set a queue capacity of 100, sent 200 requests, and expected ~100 rejections—yet got zero because Node.js processed requests faster than they accumulated.
// What you actually need: queue + worker pool
class BoundedQueue {
private queue: Request[] = [];
private readonly capacity = 100;
enqueue(request: Request): boolean {
if (this.queue.length >= this.capacity) {
return false; // HTTP 429 – fail fast
}
this.queue.push(request);
return true;
}
}
class WorkerPool {
private activeWorkers = 0;
private readonly maxWorkers = 10; // THIS controls throughput
async process(queue: BoundedQueue) {
while (this.activeWorkers < this.maxWorkers && queueHasWork()) {
this.activeWorkers++;
// process a single request...
this.activeWorkers--;
}
}
}
Idempotency Handling
class IdempotencyHandler {
private inFlight = new Set<string>();
private cache = new Map<string, { response: any; ttl: number }>();
async process(idempotencyKey: string, operation: () => Promise<any>) {
// Check cache first
const cached = this.cache.get(idempotencyKey);
if (cached) return cached.response;
// Detect concurrent duplicates
if (this.inFlight.has(idempotencyKey)) {
throw new ConflictError('Request already in progress');
}
this.inFlight.add(idempotencyKey);
try {
const response = await operation();
// Only cache successes
if (response.success) {
this.cache.set(idempotencyKey, {
response,
ttl: Date.now() + 24 * 60 * 60 * 1000,
});
}
return response;
} finally {
this.inFlight.delete(idempotencyKey);
}
}
}
Storage Options
- DynamoDB: Conditional writes with TTL for automatic cleanup
- Lambda Powertools: Built‑in idempotency utility using DynamoDB
- Step Functions: Native idempotency via execution names
// DynamoDB idempotency pattern
await dynamodb.put({
TableName: 'IdempotencyStore',
Item: {
idempotencyKey: key,
response: result,
ttl: Math.floor(Date.now() / 1000) + 86400 // 24 h
},
ConditionExpression: 'attribute_not_exists(idempotencyKey)'
});
Putting It All Together
Below is a high‑level architecture of a resilient payment‑processing pipeline on AWS:
┌─────────────────────────────────────────────────────────────┐
│ API Gateway (Rate Limiting) │
└─────────────────────┬───────────────────────────────────────┘
│
┌─────────────────────▼───────────────────────────────────────┐
│ SQS Queue (Bounded Buffer) │
└─────────────────────┬───────────────────────────────────────┘
│
┌─────────────────────▼───────────────────────────────────────┐
│ Lambda (Reserved Concurrency = 10) – Worker Pool │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ 1. Check DynamoDB idempotency store ││
│ │ 2. Process payment with retry + jitter ││
│ │ 3. Store result in DynamoDB ││
│ └─────────────────────────────────────────────────────────┘│
└─────────────────────┬───────────────────────────────────────┘
│
┌─────────────────────▼───────────────────────────────────────┐
│ DynamoDB Tables │
│ - IdempotencyStore (TTL) │
│ - ProcessingResults │
└─────────────────────────────────────────────────────────────┘
What’s Next
The resilient‑relay repository contains the full implementation. Planned enhancements:
- Dead‑letter queue handling for failed payments
- CloudWatch metrics for RED (Rate, Errors, Duration) observability
- Multi‑region failover patterns
When us‑east‑1 goes down again—and it will—your system should degrade gracefully, not catastrophically.
The AWS Builders’ Library exists because Amazon learned these lessons operating AWS itself. The jitter article alone is worth your time.
Call to Action
What reliability patterns have you implemented in your AWS architectures? I’d love to hear what’s worked—or failed spectacularly—in production.