AWS DevOps Agent Explained: Architecture, Setup, and Real Root-Cause Demo (CloudWatch + EKS)

Published: (December 6, 2025 at 05:19 PM EST)
4 min read
Source: Dev.to

Source: Dev.to

What is AWS DevOps Agent

The AWS DevOps Agent acts like a 24/7 “on‑call” engineer. It has access to a wide range of tools and data sources, enabling it to:

  • Discover and map the topology of your AWS resources and applications.
  • Investigate incidents (e.g., CloudWatch alarms, EKS pod failures).
  • Provide step‑by‑step root‑cause explanations, mitigation recommendations, and prevention guidance.

The agent does not automatically fix the issue; a human operator must apply the recommended actions.

Architecture

Dual‑Console Model

RoleConsolePurpose
AdminManagement ConsoleCreate and manage Agent Spaces, configure capabilities, and set access controls.
OperatorAWS DevOps Agent web appInteract with the agent, start investigations, and view results.

Agent Spaces

  • Logical containers that define which AWS accounts, external tools, and users the agent can access.
  • Each space uses dedicated IAM roles that grant the minimum required permissions.
  • Information is isolated per space—data from one space is not visible to another.

Security

  • IAM Identity Center (or external IdP) controls user access to the web apps, with MFA support.
  • Admins can launch the web app directly from the Management Console using their existing session.

How to Maximize the Agent’s Effectiveness

Integrate additional capabilities to give the agent richer context:

  • Multiple AWS accounts – connect cross‑account resources.
  • CI/CD pipelines – GitHub, GitLab, etc.
  • MCP servers – for extended monitoring.
  • Telemetry sources – Datadog, New Relic.
  • Ticketing & chat – ServiceNow, Slack.
  • EKS clusters – for container‑level investigations.

You can also preload runbooks to provide the agent with custom investigation hints.

Resource Discovery

The agent builds its topology in two ways:

  1. CloudFormation stacks – automatically lists all stacks and their resources (including CDK‑generated resources).
  2. Resource tags – discovers resources created outside CloudFormation (e.g., via the console or Terraform) by scanning tag key/value pairs.

Demo 1: Investigate a CloudWatch Alarm

Prerequisites

  • Access to the us‑east‑1 region (the agent is only available there).
  • A single standalone AWS account (no Organizations required).
  • Basic CloudFormation knowledge.

Steps

  1. Deploy the CloudFormation stack (creates a security group, SSH key‑pair, an EC2 instance that runs a CPU stress test, a CloudWatch alarm for CPU utilization, and an auto‑shutdown rule).

    # Example snippet – replace with the full template you used
    Resources:
      StressTestInstance:
        Type: AWS::EC2::Instance
        Properties:
          InstanceType: t3.micro
          ImageId: ami-0abcdef1234567890
          KeyName: devops‑key
          SecurityGroupIds:
            - !Ref StressTestSG
          UserData: |
            #!/bin/bash
            yum install -y stress
            stress --cpu 2 --timeout 600 &
  2. Wait 5–10 minutes for the alarm to fire (the stress test spikes CPU).

  3. Open the AWS DevOps Agent web app (via the Management Console or the direct Operator Access link).

  4. In Incident Response, locate the latest alarm and click Start Investigation.

    The agent automatically fills the investigation prompt, runs the analysis, and returns the root cause (CPU overload from the stress script).

Observations

  • When two alarms occur within ~40 minutes, the agent sometimes fails to pinpoint the root cause and must be rerun.
  • For user‑initiated incidents, the agent may omit a mitigation plan because no automated remediation is applicable.
  • The agent highlights investigation gaps (e.g., missing SSH access or CloudWatch log groups) to explain why certain details are unavailable.

You can ask follow‑up questions in natural language via the chat interface.

Demo 2: Investigate an EKS Pod Error

Prerequisites

  • Terraform code that provisions an EKS cluster and an intentionally failing NGINX pod (ImagePullBackOff).
  • The IAM role ARN of the Agent Space (found under View role permissions in the Management Console).

Steps

  1. Add the Agent Space role to the EKS cluster with the AmazonEKSAdminViewPolicy.

    # terraform/eks_role.tf
    resource "aws_iam_role_policy_attachment" "agent_access" {
      role       = aws_iam_role.eks_cluster.name
      policy_arn = "arn:aws:iam::aws:policy/AmazonEKSAdminViewPolicy"
    }
    
    # Pass the Agent Space role ARN via a variable
    variable "agent_role_arn" {}
  2. Update terraform.vars with the copied ARN and apply the Terraform configuration.

  3. Tag the newly created resources with the same tag key/value pairs used by the Agent Space (e.g., DevOps=Demo). This enables the agent to discover them.

  4. In the Agent web app, navigate to Capabilities → Edit and verify that the tags are recognized.

  5. Ask the agent: “What is causing the NGINX pod ImagePullBackOff error?”

    The agent discovers the EKS resources, identifies the missing image pull secret, and returns:

    • Root cause – missing image pull secret.
    • Mitigation steps – create the secret and attach it to the service account.
    • Rollback guidance – how to revert the change if the mitigation introduces new issues.

Takeaways

  • The agent dramatically reduces Mean Time To Recovery (MTTR) by providing precise diagnostics and actionable remediation.
  • It can suggest prevention recommendations (e.g., enforce image pull secret policies) to avoid similar incidents.

DevOps Engineer Perspective

  • Security – When properly configured, the agent is secure. Admins control exactly which AWS resources and external services the agent can access via scoped IAM roles.
  • Job impact – The agent is an assistant, not a replacement. Engineers are still required to apply fixes, design infrastructure, and build new features.
  • Future potential – Connecting additional data sources (e.g., MCP servers) can further enrich the agent’s context, making investigations even faster and more comprehensive.

The AWS DevOps Agent showcases how AI‑assisted tooling can accelerate incident response while keeping human expertise at the core of remediation.

Back to Blog

Related posts

Read more »