10 AWS Production Incidents That Taught Me Real-World SRE

Published: 1 month ago (January 8, 2026 at 11:25 AM EST)

7 min read

Source: Dev.to

Source: Dev.to

1️⃣ 3 AM Wake‑up Call – 403 Errors

Symptom

CloudWatch alarm: elevated 4XX errors.
Traffic looked normal, but ≈ 30 % of requests returned 403.

What I thought

API Gateway throttling or IAM permission issues.

What it actually was

A code deployment changed JWT validation logic.
Tokens from older mobile‑app versions (still used by ~30 % of users) were being rejected.

Fix

# Roll back the problematic deployment
aws deploy rollback \
    --application-name my-app \
    --deployment-group-name prod

# Add backward‑compatible token validation
# (code change – omitted for brevity)

Fast action

Rolled back the deployment.
Added compatibility for older tokens.
Set up a CloudWatch metric to monitor app version distribution.

2️⃣ 5XX Spikes During Peak Traffic

Symptom

5XX errors spiking; load‑balancer health checks passed.
≈ 15 % of requests failed.

What I thought

Backend service was overwhelmed.

What it actually was

Lambda functions timed out due to cold starts during the traffic surge, returning 504 Gateway Timeout via API Gateway.

Fix

# Enable provisioned concurrency for the hot functions
aws lambda put-provisioned-concurrency-config \
    --function-name my-function \
    --qualifier $LATEST \
    --provisioned-concurrent-executions 100

Implemented exponential back‑off in API Gateway integrations.

Fast action

Enabled provisioned concurrency for traffic‑sensitive Lambdas.
Added CloudWatch alarms for concurrent‑execution approaching limits.

3️⃣ Route 53 Failover to an Unready Secondary Region

Symptom

At 2 PM, Route 53 failover routed all traffic to the secondary region, which quickly became overloaded.

What I thought

Primary region was experiencing a failure.

What it actually was

A security‑group change blocked the Route 53 health‑check endpoint.
Service was healthy, but Route 53 could not verify it.

Fix

# Allow Route 53 health‑checker IP ranges
curl https://ip-ranges.amazonaws.com/ip-ranges.json |
jq -r '.prefixes[] | select(.service=="ROUTE53") | .ip_prefix' |
while read cidr; do
    aws ec2 authorize-security-group-ingress \
        --group-id sg-xxxxxx \
        --protocol tcp \
        --port 443 \
        --cidr "$cidr"
done

# Quick health‑check test
curl -v https://api.example.com/health

Fast action

Added the Route 53 health‑checker IP ranges to the security group.
Implemented internal health checks that validate both endpoint accessibility and actual service health.

4️⃣ “Connection Pool Exhausted” Errors – RDS

Symptom

Application logs: “connection pool exhausted”.
RDS metrics: CPU ≈ 20 %, connections well below max_connections.

What I thought

Need to increase max_connections on the DB.

What it actually was

The app wasn’t releasing connections after exceptions, leaving zombie connections in the pool.

Fix

# Example context manager to ensure proper cleanup
from contextlib import contextmanager

@contextmanager
def db_cursor(conn):
    cur = conn.cursor()
    try:
        yield cur
    finally:
        cur.close()
        conn.commit()

Added connection‑timeout settings, circuit‑breaker logic, and a CloudWatch dashboard tracking pool health.

Fast action

Implemented the context manager above.
Set alarms at 70 % pool utilization (instead of 95 %).

5️⃣ Lambda “Rate Exceeded” Errors During a Batch Job

Symptom

Lambda functions failed with Rate exceeded while processing a batch job.
The job halted completely.

What I thought

We hit an AWS service limit.

What it actually was

The batch job performed 10 000 concurrent DynamoDB writes with no back‑off, exhausting the table’s write capacity within seconds.

Fix

import time, random
from botocore.config import Config

def exponential_backoff_retry(func, max_retries=5):
    for attempt in range(max_retries):
        try:
            return func()
        except Exception as e:
            backoff = (2 ** attempt) + random.random()
            time.sleep(backoff)
            if attempt == max_retries - 1:
                raise e

# Use the built‑in retry config for the AWS SDK
config = Config(
    retries={
        'max_attempts': 10,
        'mode': 'standard'
    }
)

Fast action

Wrapped DynamoDB writes with the retry helper.
Enabled DynamoDB auto‑scaling for write capacity.

6️⃣ ALB Marking Healthy Instances Unhealthy

Symptom

ALB sporadically marked instances unhealthy → 502 errors for some requests.

What I thought

Instances were genuinely failing under load.

What it actually was

Health‑check interval was 5 s with a 2 s timeout.
Brief CPU spikes prevented the health‑check response, causing false‑negative health reports.

Fix

# Adjust target‑group health‑check settings
aws elbv2 modify-target-group \
    --target-group-arn arn:aws:elasticloadbalancing:... \
    --health-check-interval-seconds 30 \
    --health-check-timeout-seconds 5 \
    --healthy-threshold-count 3 \
    --unhealthy-threshold-count 3

Made the health‑check endpoint lightweight (no DB queries).

Best practice

✅ Do	❌ Don’t
Health check that only verifies the process is alive (e.g., `/ping`).	Health check that performs expensive operations (e.g., full DB query).

Fast action

Updated the health‑check configuration.
Deployed a lightweight /ping endpoint.

7️⃣ P99 Latency Spike to 8 s During Low Traffic

Symptom

P99 latency surged to 8 s while P50 stayed at 200 ms during quiet periods.

What I thought

Backend database performance degradation.

What it actually was

Lambda cold starts. Functions were being terminated during idle periods, leading to long start‑up times for the next request.

Fix

# Enable provisioned concurrency for latency‑sensitive functions
aws lambda put-provisioned-concurrency-config \
    --function-name api-handler \
    --qualifier $LATEST \
    --provisioned-concurrent-executions 200

# Keep functions warm with an EventBridge schedule (cron expression every 5 min)
aws events put-rule \
    --name WarmLambdaRule \
    --schedule-expression "rate(5 minutes)"

Reduced deployment package size by ≈60 % (removed unused libraries).

Fast action

Applied provisioned concurrency to user‑facing APIs.
Scheduled periodic “ping” invocations.
Optimized the package size.

8️⃣ DynamoDB `ProvisionedThroughputExceededException` During Report Generation

Symptom

Writes succeeded, but reads failed with ProvisionedThroughputExceededException during daily report generation.

What I thought

Need to increase read capacity units.

What it actually was

The report used a Scan operation without pagination, creating a hot partition that consumed all read capacity in seconds.

Fix

def paginated_query(table, key_condition):
    items = []
    last_evaluated_key = None

    while True:
        if last_evaluated_key:
            response = table.query(
                KeyConditionExpression=key_condition,
                ExclusiveStartKey=last_evaluated_key
            )
        else:
            response = table.query(
                KeyConditionExpression=key_condition
            )

        items.extend(response["Items"])
        last_evaluated_key = response.get("LastEvaluatedKey")
        if not last_evaluated_key:
            break

    return items

Switched to Query wherever possible.
Implemented pagination and exponential back‑off.
Enabled DynamoDB auto‑scaling and added composite sort keys for efficient access patterns.

Fast action

Refactored the report job to use the paginated query above.
Turned on auto‑scaling for the table.

9️⃣ Aggressive Health‑Check Causing False Positives

Overly aggressive health‑check intervals can cause healthy instances to be marked unhealthy.
Balance: enough frequency to catch real failures, but not so tight that transient spikes trigger false alarms.

🔟 General Lessons & Takeaways

Incident	Core Lesson
1 – JWT validation	Version compatibility matters; always support older clients during rollouts.
2 – Lambda cold starts	Provisioned concurrency + warm‑up schedules mitigate latency spikes.
3 – Route 53 health checks	Security‑group rules must allow health‑checker IP ranges.
4 – DB connection pools	Enforce proper cleanup; monitor pool usage well before saturation.
5 – DynamoDB rate limits	Build exponential back‑off & retries from day 1.
6 – ALB health checks	Keep health‑check endpoints lightweight; separate deep checks from routing checks.
7 – P99 latency	Cold starts dominate tail latency; provisioned concurrency is the antidote.
8 – DynamoDB hot partitions	Prefer Query over Scan, paginate, and design for even key distribution.
9 – Aggressive health checks	Too‑tight intervals cause false‑negatives; tune thresholds and timeouts.
10 – (Overall)	Observability first – instrument, alarm, and test assumptions before they become incidents.

By treating each incident as a learning opportunity and applying disciplined, observable fixes, you can turn chaotic production fire‑fighting into a predictable, resilient operation.

Incident 1 – 502 Errors During Blue‑Green Deployments

Symptom

5 % of requests failed on every deployment with 502 Bad Gateway errors, despite using a blue‑green deployment strategy.

Initial Hypothesis

Instances were shutting down too quickly.

Root Cause

The connection‑draining timeout (deregistration delay) on the ALB was set to 30 seconds, while some API calls took up to 60 seconds.
The ALB terminated those connections mid‑request, resulting in the 502 errors.

Fix

# Increase the connection‑draining timeout (deregistration delay)
aws elbv2 modify-target-group-attributes \
    --target-group-arn arn:aws:elasticloadbalancing:region:account-id:targetgroup/name/xxxxxxxxxxxx \
    --attributes Key=deregistration_delay.timeout_seconds,Value=120

Added a deployment health‑check that verifies no in‑flight requests are being dropped.

Fast Action Taken

Increased the deregistration delay.
Implemented a graceful‑shutdown routine in the application (stop accepting new requests, finish existing ones).
Added a pre‑deployment validation step.

Lesson Learned

The connection‑draining timeout must be longer than the longest request latency.
Regularly monitor P99 latency and set the timeout accordingly.

Incident 2 – Deployment Script Left Security Groups Inconsistent

Symptom

Deployment script failed halfway, leaving security groups in an inconsistent state.
Could not SSH to instances and could not roll back the deployment.

What I thought

Manually fix the security groups.

What it actually was

The automation script had no rollback mechanism and changed production security groups without testing.

Fix

# Open a Session Manager session to the affected instance
aws ssm start-session --target i-1234567890abcdef0

Describe current security groups (for reference).

Make changes atomically – e.g., add the required rule:

aws ec2 authorize-security-group-ingress \
    --group-id sg-12345 \
    --ip-permissions IpProtocol=tcp,FromPort=443,ToPort=443,IpRanges='[{CidrIp=0.0.0.0/0}]'

Validate that the change worked.
Remove the old rule only after successful validation.

Better Approach

Manage security groups with AWS CloudFormation (or other IaC tools) to ensure atomic, version‑controlled updates.

Fast Action Taken

Enabled Systems Manager Session Manager on all instances.
Switched security‑group management to CloudFormation.
Implemented a change‑approval workflow.

Lesson Learned

Never modify security groups manually in production; a single mistake can lock you out.
Use Infrastructure‑as‑Code and Session Manager as safety nets.

Tools That Make This Easier

When incidents happen, speed matters. I built an Incident Helper script that automates the repetitive parts of incident response:

Collects relevant CloudWatch logs.
Checks service health status.
Identifies common AWS misconfigurations.

The Real Lesson

Document every incident.
Build and maintain runbooks.
Regularly test fail‑over procedures.
Hold weekly post‑mortem discussions with the team.

The next incident is already scheduled; you just don’t know when.
Being prepared makes all the difference.

1️⃣ 3 AM Wake‑up Call – 403 Errors

Symptom

What I thought

What it actually was

Fix

Fast action

2️⃣ 5XX Spikes During Peak Traffic

Symptom

What I thought

What it actually was

Fix

Fast action

3️⃣ Route 53 Failover to an Unready Secondary Region

Symptom

What I thought

What it actually was

Fix

Fast action

4️⃣ “Connection Pool Exhausted” Errors – RDS

Symptom

What I thought

What it actually was

Fix

Fast action

5️⃣ Lambda “Rate Exceeded” Errors During a Batch Job

Symptom

What I thought

What it actually was

Fix

Fast action

6️⃣ ALB Marking Healthy Instances Unhealthy

Symptom

What I thought

What it actually was

Fix

Best practice

Fast action

7️⃣ P99 Latency Spike to 8 s During Low Traffic

Symptom

What I thought

What it actually was

Fix

Fast action

8️⃣ DynamoDB ProvisionedThroughputExceededException During Report Generation

Symptom

What I thought

What it actually was

Fix

Fast action

9️⃣ Aggressive Health‑Check Causing False Positives

🔟 General Lessons & Takeaways

Incident 1 – 502 Errors During Blue‑Green Deployments

Symptom

Initial Hypothesis

Root Cause

Fix

Fast Action Taken

Lesson Learned

Incident 2 – Deployment Script Left Security Groups Inconsistent

Symptom

What I thought

What it actually was

Fix

Better Approach

Fast Action Taken

Lesson Learned

Tools That Make This Easier

The Real Lesson

Related posts

Your 30-Minute Morning Monitoring Routine? The Problem Isn't Too Much Data.

From Zero to SQS Lambda in 15 Minutes

When Systems Work But No One Wakes Up: The Failure Between Monitoring and Human Response

How AWS Lambda and Fargate Change the Way We Build Apps

1️⃣ 3 AM Wake‑up Call – 403 Errors

2️⃣ 5XX Spikes During Peak Traffic

3️⃣ Route 53 Failover to an Unready Secondary Region

4️⃣ “Connection Pool Exhausted” Errors – RDS

5️⃣ Lambda “Rate Exceeded” Errors During a Batch Job

6️⃣ ALB Marking Healthy Instances Unhealthy

7️⃣ P99 Latency Spike to 8 s During Low Traffic

8️⃣ DynamoDB `ProvisionedThroughputExceededException` During Report Generation

9️⃣ Aggressive Health‑Check Causing False Positives

🔟 General Lessons & Takeaways

Incident 1 – 502 Errors During Blue‑Green Deployments

Incident 2 – Deployment Script Left Security Groups Inconsistent