10 AWS Production Incidents That Taught Me Real-World SRE
Source: Dev.to
1️⃣ 3 AM Wake‑up Call – 403 Errors
Symptom
- CloudWatch alarm: elevated 4XX errors.
- Traffic looked normal, but ≈ 30 % of requests returned 403.
What I thought
- API Gateway throttling or IAM permission issues.
What it actually was
- A code deployment changed JWT validation logic.
- Tokens from older mobile‑app versions (still used by ~30 % of users) were being rejected.
Fix
# Roll back the problematic deployment
aws deploy rollback \
--application-name my-app \
--deployment-group-name prod
# Add backward‑compatible token validation
# (code change – omitted for brevity)
Fast action
- Rolled back the deployment.
- Added compatibility for older tokens.
- Set up a CloudWatch metric to monitor app version distribution.
2️⃣ 5XX Spikes During Peak Traffic
Symptom
- 5XX errors spiking; load‑balancer health checks passed.
- ≈ 15 % of requests failed.
What I thought
- Backend service was overwhelmed.
What it actually was
- Lambda functions timed out due to cold starts during the traffic surge, returning 504 Gateway Timeout via API Gateway.
Fix
# Enable provisioned concurrency for the hot functions
aws lambda put-provisioned-concurrency-config \
--function-name my-function \
--qualifier $LATEST \
--provisioned-concurrent-executions 100
- Implemented exponential back‑off in API Gateway integrations.
Fast action
- Enabled provisioned concurrency for traffic‑sensitive Lambdas.
- Added CloudWatch alarms for concurrent‑execution approaching limits.
3️⃣ Route 53 Failover to an Unready Secondary Region
Symptom
- At 2 PM, Route 53 failover routed all traffic to the secondary region, which quickly became overloaded.
What I thought
- Primary region was experiencing a failure.
What it actually was
- A security‑group change blocked the Route 53 health‑check endpoint.
- Service was healthy, but Route 53 could not verify it.
Fix
# Allow Route 53 health‑checker IP ranges
curl https://ip-ranges.amazonaws.com/ip-ranges.json |
jq -r '.prefixes[] | select(.service=="ROUTE53") | .ip_prefix' |
while read cidr; do
aws ec2 authorize-security-group-ingress \
--group-id sg-xxxxxx \
--protocol tcp \
--port 443 \
--cidr "$cidr"
done
# Quick health‑check test
curl -v https://api.example.com/health
Fast action
- Added the Route 53 health‑checker IP ranges to the security group.
- Implemented internal health checks that validate both endpoint accessibility and actual service health.
4️⃣ “Connection Pool Exhausted” Errors – RDS
Symptom
- Application logs: “connection pool exhausted”.
- RDS metrics: CPU ≈ 20 %, connections well below
max_connections.
What I thought
- Need to increase
max_connectionson the DB.
What it actually was
- The app wasn’t releasing connections after exceptions, leaving zombie connections in the pool.
Fix
# Example context manager to ensure proper cleanup
from contextlib import contextmanager
@contextmanager
def db_cursor(conn):
cur = conn.cursor()
try:
yield cur
finally:
cur.close()
conn.commit()
- Added connection‑timeout settings, circuit‑breaker logic, and a CloudWatch dashboard tracking pool health.
Fast action
- Implemented the context manager above.
- Set alarms at 70 % pool utilization (instead of 95 %).
5️⃣ Lambda “Rate Exceeded” Errors During a Batch Job
Symptom
- Lambda functions failed with Rate exceeded while processing a batch job.
- The job halted completely.
What I thought
- We hit an AWS service limit.
What it actually was
- The batch job performed 10 000 concurrent DynamoDB writes with no back‑off, exhausting the table’s write capacity within seconds.
Fix
import time, random
from botocore.config import Config
def exponential_backoff_retry(func, max_retries=5):
for attempt in range(max_retries):
try:
return func()
except Exception as e:
backoff = (2 ** attempt) + random.random()
time.sleep(backoff)
if attempt == max_retries - 1:
raise e
# Use the built‑in retry config for the AWS SDK
config = Config(
retries={
'max_attempts': 10,
'mode': 'standard'
}
)
Fast action
- Wrapped DynamoDB writes with the retry helper.
- Enabled DynamoDB auto‑scaling for write capacity.
6️⃣ ALB Marking Healthy Instances Unhealthy
Symptom
- ALB sporadically marked instances unhealthy → 502 errors for some requests.
What I thought
- Instances were genuinely failing under load.
What it actually was
- Health‑check interval was 5 s with a 2 s timeout.
- Brief CPU spikes prevented the health‑check response, causing false‑negative health reports.
Fix
# Adjust target‑group health‑check settings
aws elbv2 modify-target-group \
--target-group-arn arn:aws:elasticloadbalancing:... \
--health-check-interval-seconds 30 \
--health-check-timeout-seconds 5 \
--healthy-threshold-count 3 \
--unhealthy-threshold-count 3
- Made the health‑check endpoint lightweight (no DB queries).
Best practice
| ✅ Do | ❌ Don’t |
|---|---|
Health check that only verifies the process is alive (e.g., /ping). | Health check that performs expensive operations (e.g., full DB query). |
Fast action
- Updated the health‑check configuration.
- Deployed a lightweight
/pingendpoint.
7️⃣ P99 Latency Spike to 8 s During Low Traffic
Symptom
- P99 latency surged to 8 s while P50 stayed at 200 ms during quiet periods.
What I thought
- Backend database performance degradation.
What it actually was
- Lambda cold starts. Functions were being terminated during idle periods, leading to long start‑up times for the next request.
Fix
# Enable provisioned concurrency for latency‑sensitive functions
aws lambda put-provisioned-concurrency-config \
--function-name api-handler \
--qualifier $LATEST \
--provisioned-concurrent-executions 200
# Keep functions warm with an EventBridge schedule (cron expression every 5 min)
aws events put-rule \
--name WarmLambdaRule \
--schedule-expression "rate(5 minutes)"
- Reduced deployment package size by ≈60 % (removed unused libraries).
Fast action
- Applied provisioned concurrency to user‑facing APIs.
- Scheduled periodic “ping” invocations.
- Optimized the package size.
8️⃣ DynamoDB ProvisionedThroughputExceededException During Report Generation
Symptom
- Writes succeeded, but reads failed with ProvisionedThroughputExceededException during daily report generation.
What I thought
- Need to increase read capacity units.
What it actually was
- The report used a Scan operation without pagination, creating a hot partition that consumed all read capacity in seconds.
Fix
def paginated_query(table, key_condition):
items = []
last_evaluated_key = None
while True:
if last_evaluated_key:
response = table.query(
KeyConditionExpression=key_condition,
ExclusiveStartKey=last_evaluated_key
)
else:
response = table.query(
KeyConditionExpression=key_condition
)
items.extend(response["Items"])
last_evaluated_key = response.get("LastEvaluatedKey")
if not last_evaluated_key:
break
return items
- Switched to Query wherever possible.
- Implemented pagination and exponential back‑off.
- Enabled DynamoDB auto‑scaling and added composite sort keys for efficient access patterns.
Fast action
- Refactored the report job to use the paginated query above.
- Turned on auto‑scaling for the table.
9️⃣ Aggressive Health‑Check Causing False Positives
- Overly aggressive health‑check intervals can cause healthy instances to be marked unhealthy.
- Balance: enough frequency to catch real failures, but not so tight that transient spikes trigger false alarms.
🔟 General Lessons & Takeaways
| Incident | Core Lesson |
|---|---|
| 1 – JWT validation | Version compatibility matters; always support older clients during rollouts. |
| 2 – Lambda cold starts | Provisioned concurrency + warm‑up schedules mitigate latency spikes. |
| 3 – Route 53 health checks | Security‑group rules must allow health‑checker IP ranges. |
| 4 – DB connection pools | Enforce proper cleanup; monitor pool usage well before saturation. |
| 5 – DynamoDB rate limits | Build exponential back‑off & retries from day 1. |
| 6 – ALB health checks | Keep health‑check endpoints lightweight; separate deep checks from routing checks. |
| 7 – P99 latency | Cold starts dominate tail latency; provisioned concurrency is the antidote. |
| 8 – DynamoDB hot partitions | Prefer Query over Scan, paginate, and design for even key distribution. |
| 9 – Aggressive health checks | Too‑tight intervals cause false‑negatives; tune thresholds and timeouts. |
| 10 – (Overall) | Observability first – instrument, alarm, and test assumptions before they become incidents. |
By treating each incident as a learning opportunity and applying disciplined, observable fixes, you can turn chaotic production fire‑fighting into a predictable, resilient operation.
Incident 1 – 502 Errors During Blue‑Green Deployments
Symptom
- 5 % of requests failed on every deployment with 502 Bad Gateway errors, despite using a blue‑green deployment strategy.
Initial Hypothesis
- Instances were shutting down too quickly.
Root Cause
- The connection‑draining timeout (deregistration delay) on the ALB was set to 30 seconds, while some API calls took up to 60 seconds.
- The ALB terminated those connections mid‑request, resulting in the 502 errors.
Fix
# Increase the connection‑draining timeout (deregistration delay)
aws elbv2 modify-target-group-attributes \
--target-group-arn arn:aws:elasticloadbalancing:region:account-id:targetgroup/name/xxxxxxxxxxxx \
--attributes Key=deregistration_delay.timeout_seconds,Value=120
- Added a deployment health‑check that verifies no in‑flight requests are being dropped.
Fast Action Taken
- Increased the deregistration delay.
- Implemented a graceful‑shutdown routine in the application (stop accepting new requests, finish existing ones).
- Added a pre‑deployment validation step.
Lesson Learned
- The connection‑draining timeout must be longer than the longest request latency.
- Regularly monitor P99 latency and set the timeout accordingly.
Incident 2 – Deployment Script Left Security Groups Inconsistent
Symptom
- Deployment script failed halfway, leaving security groups in an inconsistent state.
- Could not SSH to instances and could not roll back the deployment.
What I thought
- Manually fix the security groups.
What it actually was
- The automation script had no rollback mechanism and changed production security groups without testing.
Fix
# Open a Session Manager session to the affected instance
aws ssm start-session --target i-1234567890abcdef0
-
Describe current security groups (for reference).
-
Make changes atomically – e.g., add the required rule:
aws ec2 authorize-security-group-ingress \ --group-id sg-12345 \ --ip-permissions IpProtocol=tcp,FromPort=443,ToPort=443,IpRanges='[{CidrIp=0.0.0.0/0}]' -
Validate that the change worked.
-
Remove the old rule only after successful validation.
Better Approach
- Manage security groups with AWS CloudFormation (or other IaC tools) to ensure atomic, version‑controlled updates.
Fast Action Taken
- Enabled Systems Manager Session Manager on all instances.
- Switched security‑group management to CloudFormation.
- Implemented a change‑approval workflow.
Lesson Learned
- Never modify security groups manually in production; a single mistake can lock you out.
- Use Infrastructure‑as‑Code and Session Manager as safety nets.
Tools That Make This Easier
When incidents happen, speed matters. I built an Incident Helper script that automates the repetitive parts of incident response:
- Collects relevant CloudWatch logs.
- Checks service health status.
- Identifies common AWS misconfigurations.
The Real Lesson
- Document every incident.
- Build and maintain runbooks.
- Regularly test fail‑over procedures.
- Hold weekly post‑mortem discussions with the team.
The next incident is already scheduled; you just don’t know when.
Being prepared makes all the difference.