What the AWS us-east-1 Outage Taught Me About Building Resilient Systems

Published: 1 day ago (December 14, 2025 at 02:45 PM EST)

4 min read

Source: Dev.to

AWS us‑east-1 will go down again. When it does, will your system survive?
This past weekend I built a system designed to survive such an outage.

After eight years building subscription infrastructure at Surfline—processing payments through Stripe, Apple, and Google Play—I’ve learned that the real question isn’t if your cloud provider will fail, but how your architecture degrades when it does.

I spent four hours implementing three reliability patterns sourced from the AWS Builders’ Library, Google SRE practices, and Stripe’s engineering blog. Below are the key take‑aways.

Why Resilience Matters

When AWS experiences an incident, common failure modes include:

Lambda functions timing out
DynamoDB calls failing or throttling
SQS queues backing up

For most applications users simply see an error page and retry later. Payment systems are different:

A failed charge might actually have succeeded.
A retry could double‑charge a customer.
A thundering herd of retries can cascade the failure.

Therefore we need patterns that handle partial failures without losing money or trust.

Pattern 1: Retry with Jitter

The AWS Builders’ Library article on Timeouts, retries, and backoff with jitter changed how I think about retry logic. Without jitter, all clients retry at the exact same intervals, creating synchronized waves that hammer a recovering service.

// Full jitter formula from AWS Builders' Library
const calculateDelay = (attempt: number): number => {
  const exponentialDelay = Math.min(
    MAX_DELAY,
    INITIAL_DELAY * Math.pow(2, attempt)
  );
  // Full jitter: random value between 0 and exponential delay
  return Math.random() * exponentialDelay;
};

Result: In load tests the success rate jumped from ~70 % to >99 % because the jitter spreads retry load evenly across time instead of creating spikes.

Where It Helps

Lambda retrying DynamoDB during throttling
ECS tasks calling external APIs through a NAT gateway
Step Functions with retry policies on service integrations

Pattern 2: Bounded Queue + Worker Pool

A bounded queue alone does not limit concurrent processing. In a test I set a queue capacity of 100, sent 200 requests, and expected ~100 rejections—yet got zero because Node.js processed requests faster than they accumulated.

// What you actually need: queue + worker pool
class BoundedQueue {
  private queue: Request[] = [];
  private readonly capacity = 100;

  enqueue(request: Request): boolean {
    if (this.queue.length >= this.capacity) {
      return false; // HTTP 429 – fail fast
    }
    this.queue.push(request);
    return true;
  }
}

class WorkerPool {
  private activeWorkers = 0;
  private readonly maxWorkers = 10; // THIS controls throughput

  async process(queue: BoundedQueue) {
    while (this.activeWorkers < this.maxWorkers && queueHasWork()) {
      this.activeWorkers++;
      // process a single request...
      this.activeWorkers--;
    }
  }
}

Idempotency Handling

class IdempotencyHandler {
  private inFlight = new Set<string>();
  private cache = new Map<string, { response: any; ttl: number }>();

  async process(idempotencyKey: string, operation: () => Promise<any>) {
    // Check cache first
    const cached = this.cache.get(idempotencyKey);
    if (cached) return cached.response;

    // Detect concurrent duplicates
    if (this.inFlight.has(idempotencyKey)) {
      throw new ConflictError('Request already in progress');
    }

    this.inFlight.add(idempotencyKey);
    try {
      const response = await operation();
      // Only cache successes
      if (response.success) {
        this.cache.set(idempotencyKey, {
          response,
          ttl: Date.now() + 24 * 60 * 60 * 1000,
        });
      }
      return response;
    } finally {
      this.inFlight.delete(idempotencyKey);
    }
  }
}

Storage Options

DynamoDB: Conditional writes with TTL for automatic cleanup
Lambda Powertools: Built‑in idempotency utility using DynamoDB
Step Functions: Native idempotency via execution names

// DynamoDB idempotency pattern
await dynamodb.put({
  TableName: 'IdempotencyStore',
  Item: {
    idempotencyKey: key,
    response: result,
    ttl: Math.floor(Date.now() / 1000) + 86400 // 24 h
  },
  ConditionExpression: 'attribute_not_exists(idempotencyKey)'
});

Putting It All Together

Below is a high‑level architecture of a resilient payment‑processing pipeline on AWS:

┌─────────────────────────────────────────────────────────────┐
│                     API Gateway (Rate Limiting)            │
└─────────────────────┬───────────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────────┐
│                      SQS Queue (Bounded Buffer)            │
└─────────────────────┬───────────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────────┐
│          Lambda (Reserved Concurrency = 10) – Worker Pool   │
│  ┌─────────────────────────────────────────────────────────┐│
│  │ 1. Check DynamoDB idempotency store                     ││
│  │ 2. Process payment with retry + jitter                  ││
│  │ 3. Store result in DynamoDB                             ││
│  └─────────────────────────────────────────────────────────┘│
└─────────────────────┬───────────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────────┐
│                 DynamoDB Tables                            │
│  - IdempotencyStore (TTL)                                   │
│  - ProcessingResults                                        │
└─────────────────────────────────────────────────────────────┘

What’s Next

The resilient‑relay repository contains the full implementation. Planned enhancements:

Dead‑letter queue handling for failed payments
CloudWatch metrics for RED (Rate, Errors, Duration) observability
Multi‑region failover patterns

When us‑east‑1 goes down again—and it will—your system should degrade gracefully, not catastrophically.

The AWS Builders’ Library exists because Amazon learned these lessons operating AWS itself. The jitter article alone is worth your time.

Call to Action

What reliability patterns have you implemented in your AWS architectures? I’d love to hear what’s worked—or failed spectacularly—in production.

GitHub: https://github.com/your-org/resilient-relay
LinkedIn: https://www.linkedin.com/in/your-profile

What the AWS us-east-1 Outage Taught Me About Building Resilient Systems

Why Resilience Matters

Pattern 1: Retry with Jitter

Where It Helps

Pattern 2: Bounded Queue + Worker Pool

Idempotency Handling

Storage Options

Putting It All Together

What’s Next

Call to Action

Related posts

Building PolyScan: Free CC0 PBR Textures & 3D Models for Real Projects

Zero-to-Scale ML: Deploying ONNX Models on Kubernetes with FastAPI and HPA

Unpacking the Google File System Paper: A Simple Breakdown

How to Adapt Tone to User Sentiment in Voice AI and Integrate Calendar Checks

Why Resilience Matters

Pattern 1: Retry with Jitter

Where It Helps

Pattern 2: Bounded Queue + Worker Pool

Idempotency Handling

Storage Options

Putting It All Together

What’s Next

Call to Action

Related posts

Building PolyScan: Free CC0 PBR Textures & 3D Models for Real Projects

Zero-to-Scale ML: Deploying ONNX Models on Kubernetes with FastAPI and HPA

Unpacking the Google File System Paper: A Simple Breakdown

How to Adapt Tone to User Sentiment in Voice AI and Integrate Calendar Checks

Pattern 1: Retry with Jitter

Pattern 2: Bounded Queue + Worker Pool