AWS DevOps Agent: 10 best practices to get the most out of It

Published: (December 29, 2025 at 12:27 PM EST)
8 min read
Source: Dev.to

Source: Dev.to

One of the key releases at AWS re:Invent 2025 was the launch of new frontier autonomous agents:

  • AWS DevOps Agent
  • AWS Security Agent
  • Kiro Autonomous Agent

Out of these, the AWS DevOps Agent is set to revolutionize how DevOps and SRE teams work. Below are the essential best practices to get the most out of your AWS DevOps Agent.

1. AWS DevOps Agent is a capability, not a tool

“You read that correctly. The DevOps Agent is not a magic bullet that will solve all your problems while you sip your cup of tea.”

It’s a capability—its results depend on how you use it.

Example

You can’t just install an AIOps agent and expect MTTR (Mean Time to Repair) to drop automatically.

  • Alerts will still fire the same way.
  • Runbooks won’t be executable.
  • There will be no service ownership or defined SLOs (Service Level Objectives).

What you need to do

  1. Define SLOs for each service.
  2. Convert runbooks into executable processes.
  3. Provide observability.
  4. Ensure change visibility.
  5. Enable other capabilities so the agent can:
    • Correlate deployments.
    • Suggest resolutions.
    • Execute with humans in the loop.

Remember: capabilities involve people, processes, and tools, not just software.

2. Observability is the key – the agent needs context

Observability is as crucial as ever. If you thought you could park the observability discussion, you’re in for a rude shock. The agent needs context to act, and context comes from your telemetry data (metrics, logs, and traces).

How to provide that context

  • Aggregate all telemetry sources.
  • If CloudWatch isn’t your cup of tea, use integrations for top observability tools such as Datadog, Dynatrace, New Relic, and Splunk.

The goal is to let the agent see the blast radius of an incident via telemetry so it can understand the system’s internal state and act with the correct intent.

Example

SituationWithout full observabilityWith full observability
Load balancer sees 5xx errorsAgent only sees the 5xx count → suggests scaling the load balancer or services.Telemetry shows slow SQL queries and an exhausted RDS connection pool (high CPU). Agent concludes the root cause is the RDS issue, not the ALB.

Enable agents to understand the blast radius, not just the symptoms. Observability is the foundation for that context.

3. Define golden signals (latency, error rate, saturation, traffic)

Agents reason better on symptoms (effects) rather than raw alerts, which often generate noise. The more symptom data the agent has, the better it can act.

  • Instead of alerts like CPU > 80% or Memory > 75%, define thresholds such as:

    • checkout latency P95 > 2 s
    • error rate > 1 %
  • Alerts are then triggered by increased latency or rising error rates.

Result
The agent can reason about user experience even when infrastructure metrics appear normal, leading to better detection of end‑user‑impacting issues and more effective root‑cause analysis.

4. Provide actionable guidance – not just wikis

Runbooks that only offer investigation guidance are essentially documentation. To make the agent truly useful, give it executable capabilities.

How to do it

  1. Lambda functions that can:

    • Pull telemetry data.
    • Execute remediation actions.
  2. Step Functions (or other workflow engines) that orchestrate those Lambda functions.

  3. Clearly define preconditions, safe actions, and rollback steps.

Example workflow

StepDescription
Lambda 1Identify the root cause (e.g., high SQS backlog).
Lambda 2Determine the correct recovery action (e.g., restart consumer pods).
Lambda 3Execute the remediation (restart pods).
Step FunctionsOrchestrate the three Lambdas, include approval gate and rollback logic.

During an incident, the agent can:

  • Invoke the Lambdas to fetch queue depth and consumer lag.
  • Analyze failure patterns.
  • Recommend (or, after approval, execute) the Step Functions workflow.

The agent becomes an active operator, not just a passive observer.

5. Treat the agent like a human – focus on guardrails, not blanket permissions

Giving the agent full administrative access is as bad as denying it the permissions it actually needs. The agent requires a reasonable level of access to do its job.

  • Least‑privilege IAM roles are still important.
  • Guardrails: Clearly define what the agent can and cannot do.
    • Broad access for diagnostics (read‑only).
    • Tight controls for remediation actions (write/execute).

With agents, you need to become comfortable with autonomy that operates within well‑defined rails, rather than trying to block every possible action.

TL;DR Checklist

  • ☐ Treat the DevOps Agent as a capability (people + process + tools).
  • ☐ Provide full observability (metrics, logs, traces) from all sources.
  • ☐ Define golden signals and use symptom‑based alerts.
  • ☐ Replace static runbooks with executable Lambda/Step Functions workflows.
  • ☐ Apply guardrails: least‑privilege for diagnostics, controlled remediation.

Follow these practices, and you’ll unlock the full potential of the AWS DevOps Agent—turning it from a recommendation engine into an autonomous, context‑aware operator for your cloud environment.

6. Have a KT Plan for the Agent – Your Team Member Needs Some Babysitting

Treat the DevOps agent as a new team member. It may be a superhero when it comes to AWS, but it’s still a novice when it comes to your specific cloud implementation.

  • Train the agent with detailed information so it can develop a full understanding of your architecture, implementations, and business context.
  • Think of it as an expert Solution Architect who has just joined the team—don’t assume prior knowledge.
  • Share everything you have and onboard it properly, rather than letting it jump straight into firefighting.

What to provide:

  • Architecture diagrams
  • Documentation
  • Service mappings
  • Business context
  • Known failure patterns

This enables the agent to prioritize a payment API over reporting jobs when managing alerts and to avoid repeating known bad remediations.

Key takeaway: Context reduces incorrect automation actions.

7. Let Agents Know What Your Developers Are Doing

Yes, it’s a DevOps agent—but it still needs visibility into what your developers are working on. Connect your CI/CD pipelines and provide this visibility to the agent.

  • The agent can correlate operational issues with recent code changes and deployments.
  • It can identify specific commits or pipeline executions and isolate them to better understand root causes.

Reality check: Most incidents today are code‑related or deployment‑related. The old saying still holds true—if you don’t touch it, it won’t break on its own.

How this helps

  • Accelerates the agent’s ability to isolate root causes.
  • Reduces Mean Time to Resolution (MTTR).

Example

  1. A latency spike occurs at a certain time.
  2. The DevOps agent checks the CI/CD pipeline and identifies a deployment that happened shortly before the spike.
  3. The commit included changes to payment‑related files.
  4. The agent pulls additional metrics, correlates them with high confidence, and concludes that the alert is caused by the recent deployment, recommending a rollback.

Without CI/CD context, the agent would waste time investigating infrastructure issues, increasing MTTR.

8. Hold Your Agent’s Hand Until It Grows Up – Start with a Human‑in‑the‑Loop

Initially you need to be heavily involved—you can’t realistically expect a fully autonomous agent from day one.

  • Observe its behavior, explain context, and provide detailed recommendations that the agent can act on.
  • All remediation actions should go through an approval process at the beginning.

Building Trust

  • Gradually increase autonomy by putting the right guardrails in place.
  • Use chat features to provide details, discuss failure scenarios, and plan responses in real time.
  • If you notice false alarms or incorrect root‑cause analysis, correct the agent and explain why you disagree so it can learn effectively.

Goal: Proactively take action to ensure the agent succeeds, rather than waiting for it to fail.

Example:

  • The agent recommends restarting an RDS instance.
  • A human rejects the action and explains that an RDS restart could cause data loss or customer impact during peak hours.
  • The agent learns about time windows, business constraints, and safer alternatives.

In later phases, the agent can automatically restart stateless services, while still requiring approval for any data‑layer changes. Trust is built through guided autonomy.

9. Measure Agent Performance Using Business Metrics

An agent is not a shiny object that you deploy and forget about. It’s useless if it doesn’t positively improve outcomes.

Key Metrics to Track

  • Mean Time to Resolve (MTTR)
  • Noise reduction (e.g., alerts per incident)
  • Percentage of root causes identified automatically
  • Percentage of remediations executed by the agent

These metrics help you understand whether the agent is delivering real value.

Without measurement, there will be no meaningful improvement.

Example

MetricBefore AgentAfter Agent
MTTR45 minutes18 minutes
Alerts per incident12035
Auto‑diagnosed incidents0 %40 %
Auto‑remediated incidents0 %20 %

These are the real business benefits you should strive to achieve. If you can’t demonstrate measurable impact, the agent is just a shiny demo.

10. Actively Look Into Agent Investigation Gaps and Work to Resolve Them

A DevOps agent will not be right on the first attempt, especially in the early stages.

  • Continuously review investigation gaps (e.g., missed root causes, false positives).
  • Iteratively improve the agent’s knowledge base, guardrails, and automation scripts.

By systematically addressing gaps, you ensure the agent becomes more reliable and valuable over time.

Investigation Gaps

There will be many investigations the agent cannot continue due to:

  • Implementation gaps
  • Missing context
  • Lack of telemetry data
  • Missing capabilities
  • Permission issues

You need to regularly review these investigation gaps and provide the necessary inputs to the agent. Over time, this will enable the agent to become more effective and smarter in the long run.

Example

Scenario:
The agent stops investigating and reports that it is unable to determine the root cause because database query metrics are missing.

Your response:

  1. Enable RDS Performance Insights.
  2. Add slow query logs.
  3. Create a Lambda function to fetch query statistics.
# Example Lambda function to fetch RDS query stats
import boto3

def handler(event, context):
    client = boto3.client('rds')
    response = client.describe_db_instances()
    # Add logic to retrieve and process Performance Insights data
    return response

Result:
With this additional context, the agent can identify long‑running queries and suggest actions such as:

  • Index creation
  • Query throttling

Key Takeaways

  • Every failure is a training data point for your agent—not a reason to abandon it or point fingers.
  • Continuously evolve with the AWS DevOps agent and take it on the journey toward greater automation and insight.
Back to Blog

Related posts

Read more »