10 AWS Production Incidents That Taught Me Real-World SRE

Published: (January 8, 2026 at 11:25 AM EST)
7 min read
Source: Dev.to

Source: Dev.to

1️⃣ 3 AM Wake‑up Call – 403 Errors

Symptom

  • CloudWatch alarm: elevated 4XX errors.
  • Traffic looked normal, but ≈ 30 % of requests returned 403.

What I thought

  • API Gateway throttling or IAM permission issues.

What it actually was

  • A code deployment changed JWT validation logic.
  • Tokens from older mobile‑app versions (still used by ~30 % of users) were being rejected.

Fix

# Roll back the problematic deployment
aws deploy rollback \
    --application-name my-app \
    --deployment-group-name prod

# Add backward‑compatible token validation
# (code change – omitted for brevity)

Fast action

  • Rolled back the deployment.
  • Added compatibility for older tokens.
  • Set up a CloudWatch metric to monitor app version distribution.

2️⃣ 5XX Spikes During Peak Traffic

Symptom

  • 5XX errors spiking; load‑balancer health checks passed.
  • ≈ 15 % of requests failed.

What I thought

  • Backend service was overwhelmed.

What it actually was

  • Lambda functions timed out due to cold starts during the traffic surge, returning 504 Gateway Timeout via API Gateway.

Fix

# Enable provisioned concurrency for the hot functions
aws lambda put-provisioned-concurrency-config \
    --function-name my-function \
    --qualifier $LATEST \
    --provisioned-concurrent-executions 100
  • Implemented exponential back‑off in API Gateway integrations.

Fast action

  • Enabled provisioned concurrency for traffic‑sensitive Lambdas.
  • Added CloudWatch alarms for concurrent‑execution approaching limits.

3️⃣ Route 53 Failover to an Unready Secondary Region

Symptom

  • At 2 PM, Route 53 failover routed all traffic to the secondary region, which quickly became overloaded.

What I thought

  • Primary region was experiencing a failure.

What it actually was

  • A security‑group change blocked the Route 53 health‑check endpoint.
  • Service was healthy, but Route 53 could not verify it.

Fix

# Allow Route 53 health‑checker IP ranges
curl https://ip-ranges.amazonaws.com/ip-ranges.json |
jq -r '.prefixes[] | select(.service=="ROUTE53") | .ip_prefix' |
while read cidr; do
    aws ec2 authorize-security-group-ingress \
        --group-id sg-xxxxxx \
        --protocol tcp \
        --port 443 \
        --cidr "$cidr"
done
# Quick health‑check test
curl -v https://api.example.com/health

Fast action

  • Added the Route 53 health‑checker IP ranges to the security group.
  • Implemented internal health checks that validate both endpoint accessibility and actual service health.

4️⃣ “Connection Pool Exhausted” Errors – RDS

Symptom

  • Application logs: “connection pool exhausted”.
  • RDS metrics: CPU ≈ 20 %, connections well below max_connections.

What I thought

  • Need to increase max_connections on the DB.

What it actually was

  • The app wasn’t releasing connections after exceptions, leaving zombie connections in the pool.

Fix

# Example context manager to ensure proper cleanup
from contextlib import contextmanager

@contextmanager
def db_cursor(conn):
    cur = conn.cursor()
    try:
        yield cur
    finally:
        cur.close()
        conn.commit()
  • Added connection‑timeout settings, circuit‑breaker logic, and a CloudWatch dashboard tracking pool health.

Fast action

  • Implemented the context manager above.
  • Set alarms at 70 % pool utilization (instead of 95 %).

5️⃣ Lambda “Rate Exceeded” Errors During a Batch Job

Symptom

  • Lambda functions failed with Rate exceeded while processing a batch job.
  • The job halted completely.

What I thought

  • We hit an AWS service limit.

What it actually was

  • The batch job performed 10 000 concurrent DynamoDB writes with no back‑off, exhausting the table’s write capacity within seconds.

Fix

import time, random
from botocore.config import Config

def exponential_backoff_retry(func, max_retries=5):
    for attempt in range(max_retries):
        try:
            return func()
        except Exception as e:
            backoff = (2 ** attempt) + random.random()
            time.sleep(backoff)
            if attempt == max_retries - 1:
                raise e
# Use the built‑in retry config for the AWS SDK
config = Config(
    retries={
        'max_attempts': 10,
        'mode': 'standard'
    }
)

Fast action

  • Wrapped DynamoDB writes with the retry helper.
  • Enabled DynamoDB auto‑scaling for write capacity.

6️⃣ ALB Marking Healthy Instances Unhealthy

Symptom

  • ALB sporadically marked instances unhealthy → 502 errors for some requests.

What I thought

  • Instances were genuinely failing under load.

What it actually was

  • Health‑check interval was 5 s with a 2 s timeout.
  • Brief CPU spikes prevented the health‑check response, causing false‑negative health reports.

Fix

# Adjust target‑group health‑check settings
aws elbv2 modify-target-group \
    --target-group-arn arn:aws:elasticloadbalancing:... \
    --health-check-interval-seconds 30 \
    --health-check-timeout-seconds 5 \
    --healthy-threshold-count 3 \
    --unhealthy-threshold-count 3
  • Made the health‑check endpoint lightweight (no DB queries).

Best practice

✅ Do❌ Don’t
Health check that only verifies the process is alive (e.g., /ping).Health check that performs expensive operations (e.g., full DB query).

Fast action

  • Updated the health‑check configuration.
  • Deployed a lightweight /ping endpoint.

7️⃣ P99 Latency Spike to 8 s During Low Traffic

Symptom

  • P99 latency surged to 8 s while P50 stayed at 200 ms during quiet periods.

What I thought

  • Backend database performance degradation.

What it actually was

  • Lambda cold starts. Functions were being terminated during idle periods, leading to long start‑up times for the next request.

Fix

# Enable provisioned concurrency for latency‑sensitive functions
aws lambda put-provisioned-concurrency-config \
    --function-name api-handler \
    --qualifier $LATEST \
    --provisioned-concurrent-executions 200
# Keep functions warm with an EventBridge schedule (cron expression every 5 min)
aws events put-rule \
    --name WarmLambdaRule \
    --schedule-expression "rate(5 minutes)"
  • Reduced deployment package size by ≈60 % (removed unused libraries).

Fast action

  • Applied provisioned concurrency to user‑facing APIs.
  • Scheduled periodic “ping” invocations.
  • Optimized the package size.

8️⃣ DynamoDB ProvisionedThroughputExceededException During Report Generation

Symptom

  • Writes succeeded, but reads failed with ProvisionedThroughputExceededException during daily report generation.

What I thought

  • Need to increase read capacity units.

What it actually was

  • The report used a Scan operation without pagination, creating a hot partition that consumed all read capacity in seconds.

Fix

def paginated_query(table, key_condition):
    items = []
    last_evaluated_key = None

    while True:
        if last_evaluated_key:
            response = table.query(
                KeyConditionExpression=key_condition,
                ExclusiveStartKey=last_evaluated_key
            )
        else:
            response = table.query(
                KeyConditionExpression=key_condition
            )

        items.extend(response["Items"])
        last_evaluated_key = response.get("LastEvaluatedKey")
        if not last_evaluated_key:
            break

    return items
  • Switched to Query wherever possible.
  • Implemented pagination and exponential back‑off.
  • Enabled DynamoDB auto‑scaling and added composite sort keys for efficient access patterns.

Fast action

  • Refactored the report job to use the paginated query above.
  • Turned on auto‑scaling for the table.

9️⃣ Aggressive Health‑Check Causing False Positives

  • Overly aggressive health‑check intervals can cause healthy instances to be marked unhealthy.
  • Balance: enough frequency to catch real failures, but not so tight that transient spikes trigger false alarms.

🔟 General Lessons & Takeaways

IncidentCore Lesson
1 – JWT validationVersion compatibility matters; always support older clients during rollouts.
2 – Lambda cold startsProvisioned concurrency + warm‑up schedules mitigate latency spikes.
3 – Route 53 health checksSecurity‑group rules must allow health‑checker IP ranges.
4 – DB connection poolsEnforce proper cleanup; monitor pool usage well before saturation.
5 – DynamoDB rate limitsBuild exponential back‑off & retries from day 1.
6 – ALB health checksKeep health‑check endpoints lightweight; separate deep checks from routing checks.
7 – P99 latencyCold starts dominate tail latency; provisioned concurrency is the antidote.
8 – DynamoDB hot partitionsPrefer Query over Scan, paginate, and design for even key distribution.
9 – Aggressive health checksToo‑tight intervals cause false‑negatives; tune thresholds and timeouts.
10 – (Overall)Observability first – instrument, alarm, and test assumptions before they become incidents.

By treating each incident as a learning opportunity and applying disciplined, observable fixes, you can turn chaotic production fire‑fighting into a predictable, resilient operation.

Incident 1 – 502 Errors During Blue‑Green Deployments

Symptom

  • 5 % of requests failed on every deployment with 502 Bad Gateway errors, despite using a blue‑green deployment strategy.

Initial Hypothesis

  • Instances were shutting down too quickly.

Root Cause

  • The connection‑draining timeout (deregistration delay) on the ALB was set to 30 seconds, while some API calls took up to 60 seconds.
  • The ALB terminated those connections mid‑request, resulting in the 502 errors.

Fix

# Increase the connection‑draining timeout (deregistration delay)
aws elbv2 modify-target-group-attributes \
    --target-group-arn arn:aws:elasticloadbalancing:region:account-id:targetgroup/name/xxxxxxxxxxxx \
    --attributes Key=deregistration_delay.timeout_seconds,Value=120
  • Added a deployment health‑check that verifies no in‑flight requests are being dropped.

Fast Action Taken

  1. Increased the deregistration delay.
  2. Implemented a graceful‑shutdown routine in the application (stop accepting new requests, finish existing ones).
  3. Added a pre‑deployment validation step.

Lesson Learned

  • The connection‑draining timeout must be longer than the longest request latency.
  • Regularly monitor P99 latency and set the timeout accordingly.

Incident 2 – Deployment Script Left Security Groups Inconsistent

Symptom

  • Deployment script failed halfway, leaving security groups in an inconsistent state.
  • Could not SSH to instances and could not roll back the deployment.

What I thought

  • Manually fix the security groups.

What it actually was

  • The automation script had no rollback mechanism and changed production security groups without testing.

Fix

# Open a Session Manager session to the affected instance
aws ssm start-session --target i-1234567890abcdef0
  1. Describe current security groups (for reference).

  2. Make changes atomically – e.g., add the required rule:

    aws ec2 authorize-security-group-ingress \
        --group-id sg-12345 \
        --ip-permissions IpProtocol=tcp,FromPort=443,ToPort=443,IpRanges='[{CidrIp=0.0.0.0/0}]'
  3. Validate that the change worked.

  4. Remove the old rule only after successful validation.

Better Approach

  • Manage security groups with AWS CloudFormation (or other IaC tools) to ensure atomic, version‑controlled updates.

Fast Action Taken

  • Enabled Systems Manager Session Manager on all instances.
  • Switched security‑group management to CloudFormation.
  • Implemented a change‑approval workflow.

Lesson Learned

  • Never modify security groups manually in production; a single mistake can lock you out.
  • Use Infrastructure‑as‑Code and Session Manager as safety nets.

Tools That Make This Easier

When incidents happen, speed matters. I built an Incident Helper script that automates the repetitive parts of incident response:

  • Collects relevant CloudWatch logs.
  • Checks service health status.
  • Identifies common AWS misconfigurations.

The Real Lesson

  • Document every incident.
  • Build and maintain runbooks.
  • Regularly test fail‑over procedures.
  • Hold weekly post‑mortem discussions with the team.

The next incident is already scheduled; you just don’t know when.
Being prepared makes all the difference.

Back to Blog

Related posts

Read more »

From Zero to SQS Lambda in 15 Minutes

My AWS Journey I started my AWS journey when I joined my current organization. Before that, I was perfectly happy using VPS and bare‑metal servers. Long story...