Solved: DynamoDB errors in ap-southeast-2
Source: Dev.to
TL;DR
DynamoDB errors in apāsoutheastā2, often showing up as ProvisionedThroughputExceededException or connection timeouts, are frequently caused by localized network āgrey failuresā within a specific Availability Zoneānot capacity issues. Solutions range from a quick instance reboot to robust architectural fixes such as tuning AWS SDK client timeāouts and implementing a DynamoDB Gateway VPC Endpoint for private network connectivity.
Why It Happens
- AWS regions = collections of Availability Zones (AZs).
- A āgrey failureā in a single AZ can disrupt DynamoDB connectivity even when the overall region status is green.
- The AWS SDK resolves
dynamodb.ap-southeast-2.amazonaws.comto an IP that is latencyāoptimized for the callerās AZ. If that specific frontāend experiences a transient network glitch, only instances in that AZ see failures.
Pro Tip: Never assume a region is a monolithic, singleāpointāofāfailure service. Architect for failure within any individual AZ.
The Incident (A RealāWorld Story)
ā2:47āÆAM. PagerDuty screaming. Our primary auth service in Sydney (apāsoutheastā2) was throwing
ProvisionedThroughputExceededExceptionand connection timeouts to DynamoDB. CloudWatch metrics forprodāusersātablewere flatāno capacity exhaustion. Half of our login attempts were failing.ā
After an hour of debugging we discovered:
- Only instances in AZāÆapāsoutheastā2a were failing.
- Instances in 2b and 2c were healthy.
This is the classic signature of an AWS āgrey failureā: a localized, often networkārelated hiccup that doesnāt turn the AWS Status page red.
Three Playbooks ā From Quick Fix to LongāTerm Remedy
| # | Play | When to Use | What It Does |
|---|---|---|---|
| 1 | Restart the failing EC2 instance | Emergency, need to restore service in minutes | Forces a new network interface, new outbound IP, and fresh DNS resolution, often routing around the faulty network path. |
| 2 | Tune AWS SDK timeāouts & retry strategy | You want a sustainable, lowāeffort fix that reduces blast radius | Makes the client fail fast, retry aggressively, and avoid long hangs on a bad connection. |
| 3 | Deploy a DynamoDB Gateway VPC Endpoint | Building a resilient, secure architecture for the long term | Creates a private, direct connection between your VPC and DynamoDB, bypassing the public internet and eliminating many networkārelated failures. |
Play #2 ā Example: Aggressive SDK Configuration (Python/Boto3)
# Example in Python using Boto3
from botocore.config import Config
from boto3 import resource
# Configure a more aggressive timeout and retry strategy
# ⢠Connect timeout: 1āÆs
# ⢠Read timeout: 1āÆs
# ⢠Retries: 5 attempts with backoff
config = Config(
connect_timeout=1,
read_timeout=1,
retries={'max_attempts': 5}
)
# Pass this config when creating your client or resource
dynamodb = resource('dynamodb',
region_name='ap-southeast-2',
config=config)
table = dynamodb.Table('prod-users-table')
# All calls using `table` now inherit the new timeouts.
This change can turn a 30āsecond userāvisible outage into a fastāfailāandāretry scenario that most users never notice.
Play #3 ā Architecting the Problem Out of Existence
DynamoDB Gateway VPC Endpoint
- Private, direct connection between your VPC and DynamoDB.
- Traffic stays on the AWS private networkānever touches the public internet.
- Improves reliability, reduces latency, and adds a security boundary (no need for NAT/IGW egress).
Implementation steps (highālevel):
- Open the VPC console ā Endpoints ā Create Endpoint.
- Choose Service category: AWS services and select com.amazonaws.ap-southeast-2.dynamodb.
- Attach the endpoint to the relevant subnet(s) and route tables.
- (Optional) Add a policy to restrict which DynamoDB tables can be accessed.
- Update your applicationās SDK configuration to use the VPC endpoint (usually automatic once DNS resolves to the endpoint).
Bottom Line
- Grey failures in a single AZ can masquerade as capacity problems.
- Quick fix: Restart the affected instance.
- Shortāterm resilience: Tune SDK timeāouts and retries.
- Longāterm robustness: Deploy a DynamoDB Gateway VPC Endpoint.
By layering these approaches, you can keep your authentication service (or any DynamoDBābacked workload) humming even when a single AZ hiccups. š
VPC Endpoint for DynamoDB
Creating a VPC endpoint bypasses public DNS resolution and the unpredictable network paths that cause āgreyāfailures.ā Your traffic stays inside the VPC, making it both reliable and secure.
How to set it up
- Create a Gateway Endpoint in your VPC.
- Associate the endpoint with the route tables of the subnets that host your application instances.
- Update Security Groups to allow traffic to the DynamoDB service via the endpointās prefix list.
Itās a bit more work, but it virtually eliminates this class of problems while keeping database traffic off the Internet.
Solution Options
| # | Solution | Effort | Effectiveness | When to Use |
|---|---|---|---|---|
| 1 | Reboot Instance | Very Low | Low (Temporary fix) | During an active incident to restore a single node |
| 2 | Tune SDK Client | Low | High (Handles most cases) | Should be standard practice in all production applications |
| 3 | VPC Endpoint | Medium | Very High (Architectural fix) | For critical production workloads where reliability and security are paramount |
TL;DR
When you encounter a weird, regionāspecific DynamoDB error:
- Donāt immediately blame your code or capacity planning.
- Check which AZs are failing.
- Consider a VPC endpoint if the issue is recurring or impacts production reliability.
Remember: the cloud is just someone elseās computer, and sometimes the network cable between those computers gets a little loose.
š Read the original article on TechResolve.blog
ā Support my work
If this article helped you, you can buy me a coffee:
š