AWS DevOps Agent Explained: Architecture, Setup, and Real Root-Cause Demo (CloudWatch + EKS)
Source: Dev.to
What is AWS DevOps Agent
The AWS DevOps Agent acts like a 24/7 “on‑call” engineer. It has access to a wide range of tools and data sources, enabling it to:
- Discover and map the topology of your AWS resources and applications.
- Investigate incidents (e.g., CloudWatch alarms, EKS pod failures).
- Provide step‑by‑step root‑cause explanations, mitigation recommendations, and prevention guidance.
The agent does not automatically fix the issue; a human operator must apply the recommended actions.
Architecture
Dual‑Console Model
| Role | Console | Purpose |
|---|---|---|
| Admin | Management Console | Create and manage Agent Spaces, configure capabilities, and set access controls. |
| Operator | AWS DevOps Agent web app | Interact with the agent, start investigations, and view results. |
Agent Spaces
- Logical containers that define which AWS accounts, external tools, and users the agent can access.
- Each space uses dedicated IAM roles that grant the minimum required permissions.
- Information is isolated per space—data from one space is not visible to another.
Security
- IAM Identity Center (or external IdP) controls user access to the web apps, with MFA support.
- Admins can launch the web app directly from the Management Console using their existing session.
How to Maximize the Agent’s Effectiveness
Integrate additional capabilities to give the agent richer context:
- Multiple AWS accounts – connect cross‑account resources.
- CI/CD pipelines – GitHub, GitLab, etc.
- MCP servers – for extended monitoring.
- Telemetry sources – Datadog, New Relic.
- Ticketing & chat – ServiceNow, Slack.
- EKS clusters – for container‑level investigations.
You can also preload runbooks to provide the agent with custom investigation hints.
Resource Discovery
The agent builds its topology in two ways:
- CloudFormation stacks – automatically lists all stacks and their resources (including CDK‑generated resources).
- Resource tags – discovers resources created outside CloudFormation (e.g., via the console or Terraform) by scanning tag key/value pairs.
Demo 1: Investigate a CloudWatch Alarm
Prerequisites
- Access to the us‑east‑1 region (the agent is only available there).
- A single standalone AWS account (no Organizations required).
- Basic CloudFormation knowledge.
Steps
-
Deploy the CloudFormation stack (creates a security group, SSH key‑pair, an EC2 instance that runs a CPU stress test, a CloudWatch alarm for CPU utilization, and an auto‑shutdown rule).
# Example snippet – replace with the full template you used Resources: StressTestInstance: Type: AWS::EC2::Instance Properties: InstanceType: t3.micro ImageId: ami-0abcdef1234567890 KeyName: devops‑key SecurityGroupIds: - !Ref StressTestSG UserData: | #!/bin/bash yum install -y stress stress --cpu 2 --timeout 600 & -
Wait 5–10 minutes for the alarm to fire (the stress test spikes CPU).
-
Open the AWS DevOps Agent web app (via the Management Console or the direct Operator Access link).
-
In Incident Response, locate the latest alarm and click Start Investigation.
The agent automatically fills the investigation prompt, runs the analysis, and returns the root cause (CPU overload from the stress script).
Observations
- When two alarms occur within ~40 minutes, the agent sometimes fails to pinpoint the root cause and must be rerun.
- For user‑initiated incidents, the agent may omit a mitigation plan because no automated remediation is applicable.
- The agent highlights investigation gaps (e.g., missing SSH access or CloudWatch log groups) to explain why certain details are unavailable.
You can ask follow‑up questions in natural language via the chat interface.
Demo 2: Investigate an EKS Pod Error
Prerequisites
- Terraform code that provisions an EKS cluster and an intentionally failing NGINX pod (
ImagePullBackOff). - The IAM role ARN of the Agent Space (found under View role permissions in the Management Console).
Steps
-
Add the Agent Space role to the EKS cluster with the
AmazonEKSAdminViewPolicy.# terraform/eks_role.tf resource "aws_iam_role_policy_attachment" "agent_access" { role = aws_iam_role.eks_cluster.name policy_arn = "arn:aws:iam::aws:policy/AmazonEKSAdminViewPolicy" } # Pass the Agent Space role ARN via a variable variable "agent_role_arn" {} -
Update
terraform.varswith the copied ARN and apply the Terraform configuration. -
Tag the newly created resources with the same tag key/value pairs used by the Agent Space (e.g.,
DevOps=Demo). This enables the agent to discover them. -
In the Agent web app, navigate to Capabilities → Edit and verify that the tags are recognized.
-
Ask the agent: “What is causing the NGINX pod
ImagePullBackOfferror?”The agent discovers the EKS resources, identifies the missing image pull secret, and returns:
- Root cause – missing image pull secret.
- Mitigation steps – create the secret and attach it to the service account.
- Rollback guidance – how to revert the change if the mitigation introduces new issues.
Takeaways
- The agent dramatically reduces Mean Time To Recovery (MTTR) by providing precise diagnostics and actionable remediation.
- It can suggest prevention recommendations (e.g., enforce image pull secret policies) to avoid similar incidents.
DevOps Engineer Perspective
- Security – When properly configured, the agent is secure. Admins control exactly which AWS resources and external services the agent can access via scoped IAM roles.
- Job impact – The agent is an assistant, not a replacement. Engineers are still required to apply fixes, design infrastructure, and build new features.
- Future potential – Connecting additional data sources (e.g., MCP servers) can further enrich the agent’s context, making investigations even faster and more comprehensive.
The AWS DevOps Agent showcases how AI‑assisted tooling can accelerate incident response while keeping human expertise at the core of remediation.