Runbooks Don't Investigate. AWS DevOps Agent Does.

Published: (May 3, 2026 at 09:14 AM EDT)
9 min read
Source: Dev.to

Source: Dev.to

Overview

I finished the DR Toolkit thinking I had covered the important parts of disaster recovery: runbooks, RTO/RPO targets, post‑mortems. Then I mapped out the actual incident lifecycle and realized everything I built sits at the edges. The middle part (detecting the incident, correlating signals across regions, finding the root cause while the primary region is actively failing) was not covered. That gap is what this series is about.

In the BuildWithAI: DR Toolkit on AWS series, I ran through how you can build six AI‑powered tools that automate the tedious parts of DR planning, all running on serverless AWS in ap‑southeast‑1. Those tools handle what you do before an incident and what you do after. But the part in between—the actual incident response—none of them touch.

This series covers that middle phase using AWS DevOps Agent. The demo app is PayLedger, a multi‑region serverless payment ledger built specifically for this blog. It is not a real product and contains no real user data.

  • Part 1 maps out the gap, introduces DevOps Agent, and walks through the architecture.
  • Part 2 covers the full setup and the actual demo, including what the agent’s investigation looked like when I ran three real faults against it.

The DR Lifecycle, Mapped Out

PhaseWhat happensCovered by
PrepareRunbooks, RTO/RPO targets, DR strategy, checklistsDR Toolkit
DetectAlarm fires, SNS notifies DevOps Agent, health‑check fails, DNS fails overCloudWatch + Route 53 + SNS
InvestigateRoot cause analysis, cross‑region signal correlationAWS DevOps Agent
RecoverApply fix, bring the unhealthy region back up, validate failbackHuman + runbook
LearnPrevention recommendations, operational improvementsDevOps Agent

The DR Toolkit is solid for Prepare. CloudWatch and Route 53 handle Detect—alarms fire and Route 53 failover routes traffic to the healthy region automatically. But Investigate is the phase with no real tooling unless someone builds it themselves. Figuring out why a service running in the primary region is down, correlating signals across services, and giving the team the information needed to bring that region back up—that is what AWS DevOps Agent targets.


AWS DevOps Agent

AWS DevOps Agent is a frontier agent for cloud operations. “Frontier agent” is AWS’s term for autonomous systems that:

  • Work independently,
  • Scale across concurrent tasks, and
  • Run persistently without constant human oversight.

The agent starts working the moment an alarm fires—no manual trigger needed.

Three Core Capabilities

  1. Autonomous incident response

    • When an alert arrives, the agent begins investigating immediately.
    • It correlates signals across services and regions.
    • If multiple alarms stem from the same root cause, it groups them together.
    • Root‑cause categories it investigates:
      • System changes
      • Input anomalies
      • Resource limits
      • Component failures
      • Dependency issues
  2. Proactive incident prevention

    • After an investigation, the agent recommends improvements in four areas:
      • Observability
      • Infrastructure optimization
      • Deployment pipeline
      • Application resilience
  3. On‑demand SRE tasks

    • Conversational chat against your actual infrastructure.
    • Ask about resource state, alarm status, deployment history, etc., without switching consoles.

Architecture

The service uses a dual‑console architecture:

ConsolePurpose
AWS ConsoleAdmin setup (Agent Space creation, integrations).
Agent Space web appDay‑to‑day work (investigations, topology, prevention, chat).

More on features:

  • [AWS DevOps Agent features]
  • [About AWS DevOps Agent]

Availability

As of this writing, AWS DevOps Agent is not available in ap‑southeast‑1 (Singapore) at GA. Supported regions are:

us-east-1
us-west-2
eu-central-1
eu-west-1
ap-southeast-2
ap-northeast-1

AWS may add more regions in the future, so check the Supported Regions page before you start.

  • The two closest for SEA builders are ap‑southeast‑2 (Sydney) and ap‑northeast‑1 (Tokyo).
  • For this demo I used ap‑southeast‑2, but any supported region works.
  • The Agent Space and its investigation data live in the chosen region; your workload stays wherever it is.
  • Cross‑region monitoring means the agent discovers and monitors resources across any linked AWS account regardless of region.

Note: PayLedger is a demo project built solely for this blog series. It is not affiliated with any real business, does not process real transactions, and contains no personally identifiable information. All data is synthetic and generated by a seed script.


PayLedger Demo Application

A payment ledger is a practical choice for a DR demo because the requirements are clear: any outage means transactions fail and balances go stale. The multi‑region setup is the right response, not over‑engineering.

Endpoints

EndpointDescription
POST /transactionRecord a transaction
GET /transactionsList recent transactions
GET /balanceGet the current balance
GET /healthHealth check

Architecture Diagram

payledger.yourdomain.com  (CloudFront + S3)
        |
   Next.js UI
   (balance, transactions, region indicator)
        | calls
        v
api-payledger.yourdomain.com
        |
   Route 53 (failover routing)
        |-- PRIMARY   --> ap-southeast-1 (Singapore)
        +-- SECONDARY --> ap-northeast-1 (Tokyo)

ap-southeast-1                         ap-northeast-1
+-- API Gateway                        +-- API Gateway
+-- Lambda: createTransaction           +-- Lambda: createTransaction
+-- DynamoDB Global Table (replicated)  +-- DynamoDB Global Table (replicated)
  • Route 53 performs active‑passive failover between the two regions.
  • DynamoDB Global Tables replicate ledger data across regions, ensuring consistency.

Next Steps

  • Part 1 – Map the gap, introduce DevOps Agent, and walk through the architecture (this page).
  • Part 2 – Full setup and live demo, including three real fault injections and the agent’s investigation output.

Stay tuned!

Architecture Overview

+-- Lambda: listTransactions           +-- Lambda: listTransactions
+-- Lambda: getBalance                 +-- Lambda: getBalance
+-- Lambda: health                     +-- Lambda: health
+-- Lambda: devopsAgentTrigger         +-- Lambda: devopsAgentTrigger
+-- DynamoDB       +-- DynamoDB (replica)
+-- SNS Topic (alarm notifications)    +-- SNS Topic (alarm notifications)
+-- CloudWatch alarms                  +-- CloudWatch alarms

                    ap-southeast-2 (Sydney)
                    +-- AWS DevOps Agent
                        +-- Agent Space
                        +-- Slack (optional)
                        +-- GitHub (optional)

Service Layer Summary

LayerServiceNotes
FrontendNext.js (static) + S3 + CloudFrontpayledger.yourdomain.com
DNSRoute 53Fail‑over routing + health checks
ComputeLambda (Python 3.12)5 functions per region
APIAPI Gateway (HTTP API, regional)Custom domain per region
DatabaseDynamoDB Global TablesMulti‑region replication
ObservabilityCloudWatchAlarms in both regions

Fail‑over Behaviour

  • Health‑check – Route 53 calls /health every 10 s.
  • Fail‑over trigger – If the check fails twice (≈ 20 s), DNS switches traffic to the secondary region.
  • Frontend indicator – The UI polls /health every 5 s and shows a coloured badge:
    • Green – Singapore (PRIMARY)
    • Amber – Tokyo (FAILOVER)
  • Data replication – DynamoDB Global Tables keep balance and transaction history in sync across regions. After a fail‑over the data is identical; only the serving region changes.
  • Fault injection scenario – When faults are injected into ap‑southeast‑1 (Singapore), the health check starts failing, Route 53 reroutes traffic to ap‑northeast‑1 (Tokyo) within ~20 s. Users continue to be served from Tokyo while the DevOps Agent investigates. Once the root cause is fixed, the primary region recovers and Route 53 fails back.

Fault‑Injection Matrix

#FaultHow it breaksRoot‑cause category
1IAM permission deniedRole swapped to a fault role with no DynamoDB accessSystem change
2Lambda throttlingReserved concurrency set to 0 → 429 responses before function runsResource limits
3Missing environment variableTABLE_NAME removed → KeyError at module loadCode / config change

All three faults are triggered simultaneously with python scripts/fault.py inject. The default mode assigns one distinct fault per service.

  • One alarm fires in ap‑southeast‑1.
  • Three different root causes appear in the investigation.
  • The DevOps Agent must untangle all three in a single run – a tougher test than handling each fault individually.

Disaster‑Recovery (DR) Toolkit Context

  • The Prepare phase is covered by the DR Toolkit.
  • This series focuses on Investigate and Recover – the steps that happen after an alarm fires.

The AWS DevOps Agent does not need the DR Toolkit to investigate. It:

  1. Reads the topology.
  2. Correlates signals across services.
  3. Identifies root causes.
  4. Posts findings to Slack automatically.

Optionally, you can preload a runbook generated by the DR Toolkit as a Custom Skill to give the agent extra architectural knowledge.


Part 2 – Hands‑On Demo

In the next part we will:

  1. Deploy PayLedger to both regions.
  2. Configure Route 53 fail‑over.
  3. Set up the Agent Space (Slack, GitHub, etc.).
  4. Run the three simultaneous faults.

We’ll walk through the agent’s timeline, findings, root‑cause identification, and mitigation recommendations.


Get Started / Fork It

  • Repository:
  • Project name: payledger-aws-devops-agent

PayLedger – a multi‑region serverless payment ledger (demo only).
Deployed across ap‑southeast‑1 (Singapore, primary) and ap‑northeast‑1 (Tokyo, secondary) using Lambda, DynamoDB Global Tables, and Route 53 fail‑over routing.
Note: This is a demonstration project. It does not process real transactions and contains no PII.


Architecture Diagram (high‑level)

payledger.yourdomain.com (CloudFront + S3)

   Next.js static UI (balance, transactions, region indicator)


api-payledger.yourdomain.com

Route 53 fail‑over routing
├── PRIMARY   ──▶ apse1-api-payledger.yourdomain.com  ← health check
└── SECONDARY ──▶ apne1-api-payledger.yourdomain.com  ← health check
TTL: 60 s | health check: 10 s interval, 2 failures to trip

 ┌────────┴─────────┐
 │                  │
ap‑southeast‑1      ap‑northeast‑1
(Singapore)         (Tokyo)
│                   │
│  API Gateway      │  API Gateway
│  (regional)      │  (regional)
│  Lambda: createTransaction
│  Lambda: listTransactions
│  … (other functions)

└─ DynamoDB Global Table (replicated)

References

  • AWS DevOps Agent – features, supported regions, and overview.
  • Amazon DynamoDB Global Tables – multi‑region replication.
  • Amazon Route 53 – fail‑over routing configuration.
  • Disaster Recovery of Workloads on AWS – best‑practice guide.
0 views
Back to Blog

Related posts

Read more »

Claude Moves Fast. Codex Ships.

Summary I gave two big coding tasks to both Claude and Codex. - Claude finished in about one hour. - Codex took about eight hours. At first glance that looks l...