Runbooks Don't Investigate. AWS DevOps Agent Does.

Published: 1 day ago (May 3, 2026 at 09:14 AM EDT)

9 min read

Source: Dev.to

Overview

I finished the DR Toolkit thinking I had covered the important parts of disaster recovery: runbooks, RTO/RPO targets, post‑mortems. Then I mapped out the actual incident lifecycle and realized everything I built sits at the edges. The middle part (detecting the incident, correlating signals across regions, finding the root cause while the primary region is actively failing) was not covered. That gap is what this series is about.

In the BuildWithAI: DR Toolkit on AWS series, I ran through how you can build six AI‑powered tools that automate the tedious parts of DR planning, all running on serverless AWS in ap‑southeast‑1. Those tools handle what you do before an incident and what you do after. But the part in between—the actual incident response—none of them touch.

This series covers that middle phase using AWS DevOps Agent. The demo app is PayLedger, a multi‑region serverless payment ledger built specifically for this blog. It is not a real product and contains no real user data.

Part 1 maps out the gap, introduces DevOps Agent, and walks through the architecture.
Part 2 covers the full setup and the actual demo, including what the agent’s investigation looked like when I ran three real faults against it.

The DR Lifecycle, Mapped Out

Phase	What happens	Covered by
Prepare	Runbooks, RTO/RPO targets, DR strategy, checklists	DR Toolkit
Detect	Alarm fires, SNS notifies DevOps Agent, health‑check fails, DNS fails over	CloudWatch + Route 53 + SNS
Investigate	Root cause analysis, cross‑region signal correlation	AWS DevOps Agent
Recover	Apply fix, bring the unhealthy region back up, validate failback	Human + runbook
Learn	Prevention recommendations, operational improvements	DevOps Agent

The DR Toolkit is solid for Prepare. CloudWatch and Route 53 handle Detect—alarms fire and Route 53 failover routes traffic to the healthy region automatically. But Investigate is the phase with no real tooling unless someone builds it themselves. Figuring out why a service running in the primary region is down, correlating signals across services, and giving the team the information needed to bring that region back up—that is what AWS DevOps Agent targets.

AWS DevOps Agent

AWS DevOps Agent is a frontier agent for cloud operations. “Frontier agent” is AWS’s term for autonomous systems that:

Work independently,
Scale across concurrent tasks, and
Run persistently without constant human oversight.

The agent starts working the moment an alarm fires—no manual trigger needed.

Three Core Capabilities

Autonomous incident response
- When an alert arrives, the agent begins investigating immediately.
- It correlates signals across services and regions.
- If multiple alarms stem from the same root cause, it groups them together.
- Root‑cause categories it investigates:
  - System changes
  - Input anomalies
  - Resource limits
  - Component failures
  - Dependency issues
Proactive incident prevention
- After an investigation, the agent recommends improvements in four areas:
  - Observability
  - Infrastructure optimization
  - Deployment pipeline
  - Application resilience
On‑demand SRE tasks
- Conversational chat against your actual infrastructure.
- Ask about resource state, alarm status, deployment history, etc., without switching consoles.

Architecture

The service uses a dual‑console architecture:

Console	Purpose
AWS Console	Admin setup (Agent Space creation, integrations).
Agent Space web app	Day‑to‑day work (investigations, topology, prevention, chat).

More on features:

[AWS DevOps Agent features]
[About AWS DevOps Agent]

Availability

As of this writing, AWS DevOps Agent is not available in ap‑southeast‑1 (Singapore) at GA. Supported regions are:

us-east-1
us-west-2
eu-central-1
eu-west-1
ap-southeast-2
ap-northeast-1

AWS may add more regions in the future, so check the Supported Regions page before you start.

The two closest for SEA builders are ap‑southeast‑2 (Sydney) and ap‑northeast‑1 (Tokyo).
For this demo I used ap‑southeast‑2, but any supported region works.
The Agent Space and its investigation data live in the chosen region; your workload stays wherever it is.
Cross‑region monitoring means the agent discovers and monitors resources across any linked AWS account regardless of region.

Note: PayLedger is a demo project built solely for this blog series. It is not affiliated with any real business, does not process real transactions, and contains no personally identifiable information. All data is synthetic and generated by a seed script.

PayLedger Demo Application

A payment ledger is a practical choice for a DR demo because the requirements are clear: any outage means transactions fail and balances go stale. The multi‑region setup is the right response, not over‑engineering.

Endpoints

Endpoint	Description
`POST /transaction`	Record a transaction
`GET /transactions`	List recent transactions
`GET /balance`	Get the current balance
`GET /health`	Health check

Architecture Diagram

payledger.yourdomain.com  (CloudFront + S3)
        |
   Next.js UI
   (balance, transactions, region indicator)
        | calls
        v
api-payledger.yourdomain.com
        |
   Route 53 (failover routing)
        |-- PRIMARY   --> ap-southeast-1 (Singapore)
        +-- SECONDARY --> ap-northeast-1 (Tokyo)

ap-southeast-1                         ap-northeast-1
+-- API Gateway                        +-- API Gateway
+-- Lambda: createTransaction           +-- Lambda: createTransaction
+-- DynamoDB Global Table (replicated)  +-- DynamoDB Global Table (replicated)

Route 53 performs active‑passive failover between the two regions.
DynamoDB Global Tables replicate ledger data across regions, ensuring consistency.

Next Steps

Part 1 – Map the gap, introduce DevOps Agent, and walk through the architecture (this page).
Part 2 – Full setup and live demo, including three real fault injections and the agent’s investigation output.

Stay tuned!

Architecture Overview

+-- Lambda: listTransactions           +-- Lambda: listTransactions
+-- Lambda: getBalance                 +-- Lambda: getBalance
+-- Lambda: health                     +-- Lambda: health
+-- Lambda: devopsAgentTrigger         +-- Lambda: devopsAgentTrigger
+-- DynamoDB       +-- DynamoDB (replica)
+-- SNS Topic (alarm notifications)    +-- SNS Topic (alarm notifications)
+-- CloudWatch alarms                  +-- CloudWatch alarms

                    ap-southeast-2 (Sydney)
                    +-- AWS DevOps Agent
                        +-- Agent Space
                        +-- Slack (optional)
                        +-- GitHub (optional)

Service Layer Summary

Layer	Service	Notes
Frontend	Next.js (static) + S3 + CloudFront	`payledger.yourdomain.com`
DNS	Route 53	Fail‑over routing + health checks
Compute	Lambda (Python 3.12)	5 functions per region
API	API Gateway (HTTP API, regional)	Custom domain per region
Database	DynamoDB Global Tables	Multi‑region replication
Observability	CloudWatch	Alarms in both regions

Fail‑over Behaviour

Health‑check – Route 53 calls /health every 10 s.
Fail‑over trigger – If the check fails twice (≈ 20 s), DNS switches traffic to the secondary region.
Frontend indicator – The UI polls /health every 5 s and shows a coloured badge:
- Green – Singapore (PRIMARY)
- Amber – Tokyo (FAILOVER)
Data replication – DynamoDB Global Tables keep balance and transaction history in sync across regions. After a fail‑over the data is identical; only the serving region changes.
Fault injection scenario – When faults are injected into ap‑southeast‑1 (Singapore), the health check starts failing, Route 53 reroutes traffic to ap‑northeast‑1 (Tokyo) within ~20 s. Users continue to be served from Tokyo while the DevOps Agent investigates. Once the root cause is fixed, the primary region recovers and Route 53 fails back.

Fault‑Injection Matrix

#	Fault	How it breaks	Root‑cause category
1	IAM permission denied	Role swapped to a fault role with no DynamoDB access	System change
2	Lambda throttling	Reserved concurrency set to 0 → 429 responses before function runs	Resource limits
3	Missing environment variable	`TABLE_NAME` removed → `KeyError` at module load	Code / config change

All three faults are triggered simultaneously with python scripts/fault.py inject. The default mode assigns one distinct fault per service.

One alarm fires in ap‑southeast‑1.
Three different root causes appear in the investigation.
The DevOps Agent must untangle all three in a single run – a tougher test than handling each fault individually.

Disaster‑Recovery (DR) Toolkit Context

The Prepare phase is covered by the DR Toolkit.
This series focuses on Investigate and Recover – the steps that happen after an alarm fires.

The AWS DevOps Agent does not need the DR Toolkit to investigate. It:

Reads the topology.
Correlates signals across services.
Identifies root causes.
Posts findings to Slack automatically.

Optionally, you can preload a runbook generated by the DR Toolkit as a Custom Skill to give the agent extra architectural knowledge.

Part 2 – Hands‑On Demo

In the next part we will:

Deploy PayLedger to both regions.
Configure Route 53 fail‑over.
Set up the Agent Space (Slack, GitHub, etc.).
Run the three simultaneous faults.

We’ll walk through the agent’s timeline, findings, root‑cause identification, and mitigation recommendations.

Get Started / Fork It

Repository:
Project name: payledger-aws-devops-agent

PayLedger – a multi‑region serverless payment ledger (demo only).
Deployed across ap‑southeast‑1 (Singapore, primary) and ap‑northeast‑1 (Tokyo, secondary) using Lambda, DynamoDB Global Tables, and Route 53 fail‑over routing.
Note: This is a demonstration project. It does not process real transactions and contains no PII.

Architecture Diagram (high‑level)

payledger.yourdomain.com (CloudFront + S3)
          │
   Next.js static UI (balance, transactions, region indicator)
          │
          ▼
api-payledger.yourdomain.com
          │
Route 53 fail‑over routing
├── PRIMARY   ──▶ apse1-api-payledger.yourdomain.com  ← health check
└── SECONDARY ──▶ apne1-api-payledger.yourdomain.com  ← health check
TTL: 60 s | health check: 10 s interval, 2 failures to trip
          │
 ┌────────┴─────────┐
 │                  │
ap‑southeast‑1      ap‑northeast‑1
(Singapore)         (Tokyo)
│                   │
│  API Gateway      │  API Gateway
│  (regional)      │  (regional)
│  Lambda: createTransaction
│  Lambda: listTransactions
│  … (other functions)
│
└─ DynamoDB Global Table (replicated)

References

AWS DevOps Agent – features, supported regions, and overview.
Amazon DynamoDB Global Tables – multi‑region replication.
Amazon Route 53 – fail‑over routing configuration.
Disaster Recovery of Workloads on AWS – best‑practice guide.

Runbooks Don't Investigate. AWS DevOps Agent Does.

Overview

The DR Lifecycle, Mapped Out

AWS DevOps Agent

Three Core Capabilities

Architecture

Availability

PayLedger Demo Application

Endpoints

Architecture Diagram

Next Steps

Architecture Overview

Service Layer Summary

Fail‑over Behaviour

Fault‑Injection Matrix

Disaster‑Recovery (DR) Toolkit Context

Part 2 – Hands‑On Demo

Get Started / Fork It

Architecture Diagram (high‑level)

References

Related posts

Claude Moves Fast. Codex Ships.

The smarter the model, the more it saves.

Caching AI Responses in a Desktop App — Don't Pay Twice for the Same Question

LLM386: borrowing a 1990s idea for managing LLM context

Overview

The DR Lifecycle, Mapped Out

AWS DevOps Agent

Three Core Capabilities

Architecture

Availability

PayLedger Demo Application

Endpoints

Architecture Diagram

Next Steps

Architecture Overview

Service Layer Summary

Fail‑over Behaviour

Fault‑Injection Matrix

Disaster‑Recovery (DR) Toolkit Context

Part 2 – Hands‑On Demo

Get Started / Fork It

Architecture Diagram (high‑level)

References

Related posts

Claude Moves Fast. Codex Ships.

The smarter the model, the more it saves.

Caching AI Responses in a Desktop App — Don't Pay Twice for the Same Question

LLM386: borrowing a 1990s idea for managing LLM context

Part 2 – Hands‑On Demo