What the AWS us-east-1 Outage Taught Me About Building Resilient Systems

Published: (December 14, 2025 at 02:45 PM EST)
4 min read
Source: Dev.to

Source: Dev.to

AWS us‑east-1 will go down again. When it does, will your system survive?
This past weekend I built a system designed to survive such an outage.

After eight years building subscription infrastructure at Surfline—processing payments through Stripe, Apple, and Google Play—I’ve learned that the real question isn’t if your cloud provider will fail, but how your architecture degrades when it does.

I spent four hours implementing three reliability patterns sourced from the AWS Builders’ Library, Google SRE practices, and Stripe’s engineering blog. Below are the key take‑aways.

Why Resilience Matters

When AWS experiences an incident, common failure modes include:

  • Lambda functions timing out
  • DynamoDB calls failing or throttling
  • SQS queues backing up

For most applications users simply see an error page and retry later. Payment systems are different:

  • A failed charge might actually have succeeded.
  • A retry could double‑charge a customer.
  • A thundering herd of retries can cascade the failure.

Therefore we need patterns that handle partial failures without losing money or trust.

Pattern 1: Retry with Jitter

The AWS Builders’ Library article on Timeouts, retries, and backoff with jitter changed how I think about retry logic. Without jitter, all clients retry at the exact same intervals, creating synchronized waves that hammer a recovering service.

// Full jitter formula from AWS Builders' Library
const calculateDelay = (attempt: number): number => {
  const exponentialDelay = Math.min(
    MAX_DELAY,
    INITIAL_DELAY * Math.pow(2, attempt)
  );
  // Full jitter: random value between 0 and exponential delay
  return Math.random() * exponentialDelay;
};

Result: In load tests the success rate jumped from ~70 % to >99 % because the jitter spreads retry load evenly across time instead of creating spikes.

Where It Helps

  • Lambda retrying DynamoDB during throttling
  • ECS tasks calling external APIs through a NAT gateway
  • Step Functions with retry policies on service integrations

Pattern 2: Bounded Queue + Worker Pool

A bounded queue alone does not limit concurrent processing. In a test I set a queue capacity of 100, sent 200 requests, and expected ~100 rejections—yet got zero because Node.js processed requests faster than they accumulated.

// What you actually need: queue + worker pool
class BoundedQueue {
  private queue: Request[] = [];
  private readonly capacity = 100;

  enqueue(request: Request): boolean {
    if (this.queue.length >= this.capacity) {
      return false; // HTTP 429 – fail fast
    }
    this.queue.push(request);
    return true;
  }
}
class WorkerPool {
  private activeWorkers = 0;
  private readonly maxWorkers = 10; // THIS controls throughput

  async process(queue: BoundedQueue) {
    while (this.activeWorkers < this.maxWorkers && queueHasWork()) {
      this.activeWorkers++;
      // process a single request...
      this.activeWorkers--;
    }
  }
}

Idempotency Handling

class IdempotencyHandler {
  private inFlight = new Set<string>();
  private cache = new Map<string, { response: any; ttl: number }>();

  async process(idempotencyKey: string, operation: () => Promise<any>) {
    // Check cache first
    const cached = this.cache.get(idempotencyKey);
    if (cached) return cached.response;

    // Detect concurrent duplicates
    if (this.inFlight.has(idempotencyKey)) {
      throw new ConflictError('Request already in progress');
    }

    this.inFlight.add(idempotencyKey);
    try {
      const response = await operation();
      // Only cache successes
      if (response.success) {
        this.cache.set(idempotencyKey, {
          response,
          ttl: Date.now() + 24 * 60 * 60 * 1000,
        });
      }
      return response;
    } finally {
      this.inFlight.delete(idempotencyKey);
    }
  }
}

Storage Options

  • DynamoDB: Conditional writes with TTL for automatic cleanup
  • Lambda Powertools: Built‑in idempotency utility using DynamoDB
  • Step Functions: Native idempotency via execution names
// DynamoDB idempotency pattern
await dynamodb.put({
  TableName: 'IdempotencyStore',
  Item: {
    idempotencyKey: key,
    response: result,
    ttl: Math.floor(Date.now() / 1000) + 86400 // 24 h
  },
  ConditionExpression: 'attribute_not_exists(idempotencyKey)'
});

Putting It All Together

Below is a high‑level architecture of a resilient payment‑processing pipeline on AWS:

┌─────────────────────────────────────────────────────────────┐
│                     API Gateway (Rate Limiting)            │
└─────────────────────┬───────────────────────────────────────┘

┌─────────────────────▼───────────────────────────────────────┐
│                      SQS Queue (Bounded Buffer)            │
└─────────────────────┬───────────────────────────────────────┘

┌─────────────────────▼───────────────────────────────────────┐
│          Lambda (Reserved Concurrency = 10) – Worker Pool   │
│  ┌─────────────────────────────────────────────────────────┐│
│  │ 1. Check DynamoDB idempotency store                     ││
│  │ 2. Process payment with retry + jitter                  ││
│  │ 3. Store result in DynamoDB                             ││
│  └─────────────────────────────────────────────────────────┘│
└─────────────────────┬───────────────────────────────────────┘

┌─────────────────────▼───────────────────────────────────────┐
│                 DynamoDB Tables                            │
│  - IdempotencyStore (TTL)                                   │
│  - ProcessingResults                                        │
└─────────────────────────────────────────────────────────────┘

What’s Next

The resilient‑relay repository contains the full implementation. Planned enhancements:

  • Dead‑letter queue handling for failed payments
  • CloudWatch metrics for RED (Rate, Errors, Duration) observability
  • Multi‑region failover patterns

When us‑east‑1 goes down again—and it will—your system should degrade gracefully, not catastrophically.

The AWS Builders’ Library exists because Amazon learned these lessons operating AWS itself. The jitter article alone is worth your time.

Call to Action

What reliability patterns have you implemented in your AWS architectures? I’d love to hear what’s worked—or failed spectacularly—in production.

Back to Blog

Related posts

Read more »