AWS DevOps Agent Explained: Architecture, Setup, and Real Root-Cause Demo (CloudWatch + EKS)

Published: 1 week ago (December 6, 2025 at 05:19 PM EST)

4 min read

Source: Dev.to

What is AWS DevOps Agent

The AWS DevOps Agent acts like a 24/7 “on‑call” engineer. It has access to a wide range of tools and data sources, enabling it to:

Discover and map the topology of your AWS resources and applications.
Investigate incidents (e.g., CloudWatch alarms, EKS pod failures).
Provide step‑by‑step root‑cause explanations, mitigation recommendations, and prevention guidance.

The agent does not automatically fix the issue; a human operator must apply the recommended actions.

Architecture

Dual‑Console Model

Role	Console	Purpose
Admin	Management Console	Create and manage Agent Spaces, configure capabilities, and set access controls.
Operator	AWS DevOps Agent web app	Interact with the agent, start investigations, and view results.

Agent Spaces

Logical containers that define which AWS accounts, external tools, and users the agent can access.
Each space uses dedicated IAM roles that grant the minimum required permissions.
Information is isolated per space—data from one space is not visible to another.

Security

IAM Identity Center (or external IdP) controls user access to the web apps, with MFA support.
Admins can launch the web app directly from the Management Console using their existing session.

How to Maximize the Agent’s Effectiveness

Integrate additional capabilities to give the agent richer context:

Multiple AWS accounts – connect cross‑account resources.
CI/CD pipelines – GitHub, GitLab, etc.
MCP servers – for extended monitoring.
Telemetry sources – Datadog, New Relic.
Ticketing & chat – ServiceNow, Slack.
EKS clusters – for container‑level investigations.

You can also preload runbooks to provide the agent with custom investigation hints.

Resource Discovery

The agent builds its topology in two ways:

CloudFormation stacks – automatically lists all stacks and their resources (including CDK‑generated resources).
Resource tags – discovers resources created outside CloudFormation (e.g., via the console or Terraform) by scanning tag key/value pairs.

Demo 1: Investigate a CloudWatch Alarm

Prerequisites

Access to the us‑east‑1 region (the agent is only available there).
A single standalone AWS account (no Organizations required).
Basic CloudFormation knowledge.

Steps

Deploy the CloudFormation stack (creates a security group, SSH key‑pair, an EC2 instance that runs a CPU stress test, a CloudWatch alarm for CPU utilization, and an auto‑shutdown rule).

# Example snippet – replace with the full template you used
Resources:
  StressTestInstance:
    Type: AWS::EC2::Instance
    Properties:
      InstanceType: t3.micro
      ImageId: ami-0abcdef1234567890
      KeyName: devops‑key
      SecurityGroupIds:
        - !Ref StressTestSG
      UserData: |
        #!/bin/bash
        yum install -y stress
        stress --cpu 2 --timeout 600 &

Wait 5–10 minutes for the alarm to fire (the stress test spikes CPU).
Open the AWS DevOps Agent web app (via the Management Console or the direct Operator Access link).
In Incident Response, locate the latest alarm and click Start Investigation.

The agent automatically fills the investigation prompt, runs the analysis, and returns the root cause (CPU overload from the stress script).

Observations

When two alarms occur within ~40 minutes, the agent sometimes fails to pinpoint the root cause and must be rerun.
For user‑initiated incidents, the agent may omit a mitigation plan because no automated remediation is applicable.
The agent highlights investigation gaps (e.g., missing SSH access or CloudWatch log groups) to explain why certain details are unavailable.

You can ask follow‑up questions in natural language via the chat interface.

Demo 2: Investigate an EKS Pod Error

Prerequisites

Terraform code that provisions an EKS cluster and an intentionally failing NGINX pod (ImagePullBackOff).
The IAM role ARN of the Agent Space (found under View role permissions in the Management Console).

Steps

Add the Agent Space role to the EKS cluster with the AmazonEKSAdminViewPolicy.

# terraform/eks_role.tf
resource "aws_iam_role_policy_attachment" "agent_access" {
  role       = aws_iam_role.eks_cluster.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonEKSAdminViewPolicy"
}

# Pass the Agent Space role ARN via a variable
variable "agent_role_arn" {}

Update terraform.vars with the copied ARN and apply the Terraform configuration.
Tag the newly created resources with the same tag key/value pairs used by the Agent Space (e.g., DevOps=Demo). This enables the agent to discover them.
In the Agent web app, navigate to Capabilities → Edit and verify that the tags are recognized.
Ask the agent: “What is causing the NGINX pod ImagePullBackOff error?”

The agent discovers the EKS resources, identifies the missing image pull secret, and returns:
- Root cause – missing image pull secret.
- Mitigation steps – create the secret and attach it to the service account.
- Rollback guidance – how to revert the change if the mitigation introduces new issues.

Takeaways

The agent dramatically reduces Mean Time To Recovery (MTTR) by providing precise diagnostics and actionable remediation.
It can suggest prevention recommendations (e.g., enforce image pull secret policies) to avoid similar incidents.

DevOps Engineer Perspective

Security – When properly configured, the agent is secure. Admins control exactly which AWS resources and external services the agent can access via scoped IAM roles.
Job impact – The agent is an assistant, not a replacement. Engineers are still required to apply fixes, design infrastructure, and build new features.
Future potential – Connecting additional data sources (e.g., MCP servers) can further enrich the agent’s context, making investigations even faster and more comprehensive.

The AWS DevOps Agent showcases how AI‑assisted tooling can accelerate incident response while keeping human expertise at the core of remediation.

AWS DevOps Agent Explained: Architecture, Setup, and Real Root-Cause Demo (CloudWatch + EKS)

What is AWS DevOps Agent

Architecture

Dual‑Console Model

Agent Spaces

Security

How to Maximize the Agent’s Effectiveness

Resource Discovery

Demo 1: Investigate a CloudWatch Alarm

Prerequisites

Steps

Observations

Demo 2: Investigate an EKS Pod Error

Prerequisites

Steps

Takeaways

DevOps Engineer Perspective

Related posts

We found our site was slow in Singapore but perfect in Europe — here's why

I put a Game Boy inside ChatGPT (ChatGPT Apps)

Advent of AI - Day 13: Goose Terminal Integration

A Day in the Life of a Marketing Manager Using Microsoft Planner

What is AWS DevOps Agent

Architecture

Dual‑Console Model

Agent Spaces

Security

How to Maximize the Agent’s Effectiveness

Resource Discovery

Demo 1: Investigate a CloudWatch Alarm

Prerequisites

Steps

Observations

Demo 2: Investigate an EKS Pod Error

Prerequisites

Steps

Takeaways

DevOps Engineer Perspective

Related posts

We found our site was slow in Singapore but perfect in Europe — here's why

I put a Game Boy inside ChatGPT (ChatGPT Apps)

Advent of AI - Day 13: Goose Terminal Integration

A Day in the Life of a Marketing Manager Using Microsoft Planner

What is AWS DevOps Agent

Demo 1: Investigate a CloudWatch Alarm

Demo 2: Investigate an EKS Pod Error