Runbooks Don't Investigate. AWS DevOps Agent Does.
Source: Dev.to
Overview
I finished the DR Toolkit thinking I had covered the important parts of disaster recovery: runbooks, RTO/RPO targets, post‑mortems. Then I mapped out the actual incident lifecycle and realized everything I built sits at the edges. The middle part (detecting the incident, correlating signals across regions, finding the root cause while the primary region is actively failing) was not covered. That gap is what this series is about.
In the BuildWithAI: DR Toolkit on AWS series, I ran through how you can build six AI‑powered tools that automate the tedious parts of DR planning, all running on serverless AWS in ap‑southeast‑1. Those tools handle what you do before an incident and what you do after. But the part in between—the actual incident response—none of them touch.
This series covers that middle phase using AWS DevOps Agent. The demo app is PayLedger, a multi‑region serverless payment ledger built specifically for this blog. It is not a real product and contains no real user data.
- Part 1 maps out the gap, introduces DevOps Agent, and walks through the architecture.
- Part 2 covers the full setup and the actual demo, including what the agent’s investigation looked like when I ran three real faults against it.
The DR Lifecycle, Mapped Out
| Phase | What happens | Covered by |
|---|---|---|
| Prepare | Runbooks, RTO/RPO targets, DR strategy, checklists | DR Toolkit |
| Detect | Alarm fires, SNS notifies DevOps Agent, health‑check fails, DNS fails over | CloudWatch + Route 53 + SNS |
| Investigate | Root cause analysis, cross‑region signal correlation | AWS DevOps Agent |
| Recover | Apply fix, bring the unhealthy region back up, validate failback | Human + runbook |
| Learn | Prevention recommendations, operational improvements | DevOps Agent |
The DR Toolkit is solid for Prepare. CloudWatch and Route 53 handle Detect—alarms fire and Route 53 failover routes traffic to the healthy region automatically. But Investigate is the phase with no real tooling unless someone builds it themselves. Figuring out why a service running in the primary region is down, correlating signals across services, and giving the team the information needed to bring that region back up—that is what AWS DevOps Agent targets.
AWS DevOps Agent
AWS DevOps Agent is a frontier agent for cloud operations. “Frontier agent” is AWS’s term for autonomous systems that:
- Work independently,
- Scale across concurrent tasks, and
- Run persistently without constant human oversight.
The agent starts working the moment an alarm fires—no manual trigger needed.
Three Core Capabilities
-
Autonomous incident response
- When an alert arrives, the agent begins investigating immediately.
- It correlates signals across services and regions.
- If multiple alarms stem from the same root cause, it groups them together.
- Root‑cause categories it investigates:
- System changes
- Input anomalies
- Resource limits
- Component failures
- Dependency issues
-
Proactive incident prevention
- After an investigation, the agent recommends improvements in four areas:
- Observability
- Infrastructure optimization
- Deployment pipeline
- Application resilience
- After an investigation, the agent recommends improvements in four areas:
-
On‑demand SRE tasks
- Conversational chat against your actual infrastructure.
- Ask about resource state, alarm status, deployment history, etc., without switching consoles.
Architecture
The service uses a dual‑console architecture:
| Console | Purpose |
|---|---|
| AWS Console | Admin setup (Agent Space creation, integrations). |
| Agent Space web app | Day‑to‑day work (investigations, topology, prevention, chat). |
More on features:
- [AWS DevOps Agent features]
- [About AWS DevOps Agent]
Availability
As of this writing, AWS DevOps Agent is not available in ap‑southeast‑1 (Singapore) at GA. Supported regions are:
us-east-1
us-west-2
eu-central-1
eu-west-1
ap-southeast-2
ap-northeast-1
AWS may add more regions in the future, so check the Supported Regions page before you start.
- The two closest for SEA builders are
ap‑southeast‑2(Sydney) andap‑northeast‑1(Tokyo). - For this demo I used
ap‑southeast‑2, but any supported region works. - The Agent Space and its investigation data live in the chosen region; your workload stays wherever it is.
- Cross‑region monitoring means the agent discovers and monitors resources across any linked AWS account regardless of region.
Note: PayLedger is a demo project built solely for this blog series. It is not affiliated with any real business, does not process real transactions, and contains no personally identifiable information. All data is synthetic and generated by a seed script.
PayLedger Demo Application
A payment ledger is a practical choice for a DR demo because the requirements are clear: any outage means transactions fail and balances go stale. The multi‑region setup is the right response, not over‑engineering.
Endpoints
| Endpoint | Description |
|---|---|
POST /transaction | Record a transaction |
GET /transactions | List recent transactions |
GET /balance | Get the current balance |
GET /health | Health check |
Architecture Diagram
payledger.yourdomain.com (CloudFront + S3)
|
Next.js UI
(balance, transactions, region indicator)
| calls
v
api-payledger.yourdomain.com
|
Route 53 (failover routing)
|-- PRIMARY --> ap-southeast-1 (Singapore)
+-- SECONDARY --> ap-northeast-1 (Tokyo)
ap-southeast-1 ap-northeast-1
+-- API Gateway +-- API Gateway
+-- Lambda: createTransaction +-- Lambda: createTransaction
+-- DynamoDB Global Table (replicated) +-- DynamoDB Global Table (replicated)
- Route 53 performs active‑passive failover between the two regions.
- DynamoDB Global Tables replicate ledger data across regions, ensuring consistency.
Next Steps
- Part 1 – Map the gap, introduce DevOps Agent, and walk through the architecture (this page).
- Part 2 – Full setup and live demo, including three real fault injections and the agent’s investigation output.
Stay tuned!
Architecture Overview
+-- Lambda: listTransactions +-- Lambda: listTransactions
+-- Lambda: getBalance +-- Lambda: getBalance
+-- Lambda: health +-- Lambda: health
+-- Lambda: devopsAgentTrigger +-- Lambda: devopsAgentTrigger
+-- DynamoDB +-- DynamoDB (replica)
+-- SNS Topic (alarm notifications) +-- SNS Topic (alarm notifications)
+-- CloudWatch alarms +-- CloudWatch alarms
ap-southeast-2 (Sydney)
+-- AWS DevOps Agent
+-- Agent Space
+-- Slack (optional)
+-- GitHub (optional)
Service Layer Summary
| Layer | Service | Notes |
|---|---|---|
| Frontend | Next.js (static) + S3 + CloudFront | payledger.yourdomain.com |
| DNS | Route 53 | Fail‑over routing + health checks |
| Compute | Lambda (Python 3.12) | 5 functions per region |
| API | API Gateway (HTTP API, regional) | Custom domain per region |
| Database | DynamoDB Global Tables | Multi‑region replication |
| Observability | CloudWatch | Alarms in both regions |
Fail‑over Behaviour
- Health‑check – Route 53 calls
/healthevery 10 s. - Fail‑over trigger – If the check fails twice (≈ 20 s), DNS switches traffic to the secondary region.
- Frontend indicator – The UI polls
/healthevery 5 s and shows a coloured badge:- Green – Singapore (PRIMARY)
- Amber – Tokyo (FAILOVER)
- Data replication – DynamoDB Global Tables keep balance and transaction history in sync across regions. After a fail‑over the data is identical; only the serving region changes.
- Fault injection scenario – When faults are injected into ap‑southeast‑1 (Singapore), the health check starts failing, Route 53 reroutes traffic to ap‑northeast‑1 (Tokyo) within ~20 s. Users continue to be served from Tokyo while the DevOps Agent investigates. Once the root cause is fixed, the primary region recovers and Route 53 fails back.
Fault‑Injection Matrix
| # | Fault | How it breaks | Root‑cause category |
|---|---|---|---|
| 1 | IAM permission denied | Role swapped to a fault role with no DynamoDB access | System change |
| 2 | Lambda throttling | Reserved concurrency set to 0 → 429 responses before function runs | Resource limits |
| 3 | Missing environment variable | TABLE_NAME removed → KeyError at module load | Code / config change |
All three faults are triggered simultaneously with python scripts/fault.py inject. The default mode assigns one distinct fault per service.
- One alarm fires in ap‑southeast‑1.
- Three different root causes appear in the investigation.
- The DevOps Agent must untangle all three in a single run – a tougher test than handling each fault individually.
Disaster‑Recovery (DR) Toolkit Context
- The Prepare phase is covered by the DR Toolkit.
- This series focuses on Investigate and Recover – the steps that happen after an alarm fires.
The AWS DevOps Agent does not need the DR Toolkit to investigate. It:
- Reads the topology.
- Correlates signals across services.
- Identifies root causes.
- Posts findings to Slack automatically.
Optionally, you can preload a runbook generated by the DR Toolkit as a Custom Skill to give the agent extra architectural knowledge.
Part 2 – Hands‑On Demo
In the next part we will:
- Deploy PayLedger to both regions.
- Configure Route 53 fail‑over.
- Set up the Agent Space (Slack, GitHub, etc.).
- Run the three simultaneous faults.
We’ll walk through the agent’s timeline, findings, root‑cause identification, and mitigation recommendations.
Get Started / Fork It
- Repository:
- Project name:
payledger-aws-devops-agent
PayLedger – a multi‑region serverless payment ledger (demo only).
Deployed across ap‑southeast‑1 (Singapore, primary) and ap‑northeast‑1 (Tokyo, secondary) using Lambda, DynamoDB Global Tables, and Route 53 fail‑over routing.
Note: This is a demonstration project. It does not process real transactions and contains no PII.
Architecture Diagram (high‑level)
payledger.yourdomain.com (CloudFront + S3)
│
Next.js static UI (balance, transactions, region indicator)
│
▼
api-payledger.yourdomain.com
│
Route 53 fail‑over routing
├── PRIMARY ──▶ apse1-api-payledger.yourdomain.com ← health check
└── SECONDARY ──▶ apne1-api-payledger.yourdomain.com ← health check
TTL: 60 s | health check: 10 s interval, 2 failures to trip
│
┌────────┴─────────┐
│ │
ap‑southeast‑1 ap‑northeast‑1
(Singapore) (Tokyo)
│ │
│ API Gateway │ API Gateway
│ (regional) │ (regional)
│ Lambda: createTransaction
│ Lambda: listTransactions
│ … (other functions)
│
└─ DynamoDB Global Table (replicated)
References
- AWS DevOps Agent – features, supported regions, and overview.
- Amazon DynamoDB Global Tables – multi‑region replication.
- Amazon Route 53 – fail‑over routing configuration.
- Disaster Recovery of Workloads on AWS – best‑practice guide.