CloudWatch Investigations: Your AI-Powered Troubleshooting Sidekick

Published: (January 4, 2026 at 03:04 PM EST)
4 min read
Source: Dev.to

Source: Dev.to

Remember those 3 AM incidents when you’re frantically switching between dashboards, digging through logs, and wondering if you should just restart everything?
We’ve all worked non‑business hours, weekends, or midnights to troubleshoot production issues – an energy‑draining task.
What if, in this GenAI world, we had an AI assistant that works 24 × 7 and guides us through the chaos?

Enter CloudWatch Investigations, a generative‑AI‑powered feature that’s changing how we handle incidents in AWS environments.

How It Works

When something breaks, instead of you jumping between CloudWatch metrics, logs, deployment history, CloudTrail, X‑Ray, and health dashboards, CloudWatch Investigations does the first round of detective work for you.

It uses generative AI to scan your system’s telemetry and quickly surface:

  • Suspicious metrics
  • Relevant logs
  • Recent deployments or configuration changes
  • Possible root‑cause hypotheses (especially when multiple resources are involved)

All of this is presented visually, so you can see how things are connected instead of guessing. It’s like having an extra team member who’s been staring at your system architecture 24 / 7.

Getting Started

  1. Open the console – In the AWS Console go to CloudWatch → AI Operations (left pane).
  2. First‑time setup – If this is the first time you’re configuring the account, you’ll be prompted to set up an Investigation Group.

Create an Investigation Group

SettingDescription
Retention daysHow long investigations are kept. Note: the retention period cannot be changed after it’s set.
Customise encryptionUse a customer‑managed KMS key for encryption. Make sure the required permissions are granted to the key.
IAM roleCloudWatch Investigations creates a role with the required read‑only permissions. You can also create a custom role. By default it attaches:
AIOpsAssistantPolicy
AmazonRDSPerformanceInsightsFullAccess
AIOpsAssistantIncidentReportPolicy

Once the group is created you’ll see Optional Enhanced Configuration options.

Enhanced configuration UI

Enhanced Integration Options

  • Application tags – Include tags related to your application to help CloudWatch narrow down investigations.
  • CloudTrail access – Enables the service to discover relevant change events.
  • Optional data sources – X‑Ray, Application Signals, and EKS access entries.

Demo

The Sample Application

For this demo we use a simple Event‑Booking app:

Event‑booking architecture

  • User flow – Users book appointments by providing details and selecting an available slot.
  • Admin flow – Admins approve or reject requests.

User booking UI

Admin UI

Introducing a Failure

  1. Modify Lambda role – Remove the KMS permission from the Lambda execution role.

    Lambda role without KMS

  2. Simulated outage – Users start seeing errors when trying to view slots, and the admin cannot see any appointments.

    User error screenshot

  3. Initial investigation – The entry point for the app is CloudFront. Checking CloudFront shows a spike in 5xx errors.

    CloudFront 5xx errors

(Continue the investigation using CloudWatch Investigations – it will automatically surface the missing KMS permission, point out the affected Lambda, and suggest remediation steps.)

Recap

  • CloudWatch Investigations leverages generative AI to turn raw telemetry into actionable insights.
  • It reduces mean‑time‑to‑detect (MTTD) and mean‑time‑to‑resolve (MTTR) by automating the first detective steps.
  • The feature integrates with existing AWS services (CloudWatch, CloudTrail, X‑Ray, etc.) and can be fine‑tuned with tags and optional data sources.

Give it a try the next time you’re pulled into a 3 AM incident – let the AI do the heavy lifting while you focus on fixing the root cause.

Starting Investigation

  • Under CloudWatch metrics 5xx, you can start an investigation to find out why the service is returning 5xx errors.

    cwistart

    The view will automatically pick the most recent timestamp, but you can adjust the start time if you wish.

    aiopos

  • Once the investigation is started, it takes 10‑15 minutes to finish. You can watch the progress, or use that time to communicate with users/businesses or start other parallel activities.

  • When the investigation completes, it clearly shows what went wrong and why we are getting 5xx errors 🥳🥳🥳

    Under Root Cause Summary the issue was identified as an IAM configuration problem.

    Analysis – This failure pattern represents an IAM configuration issue rather than a service degradation, as evidenced by the specific KMS permission errors and the “NEW” occurrence pattern indicating a recent permission change affecting the eventap staging service components.

    aiopsrca‑1

Back to Blog

Related posts

Read more »

The RGB LED Sidequest 💡

markdown !Jennifer Davishttps://media2.dev.to/dynamic/image/width=50,height=50,fit=cover,gravity=auto,format=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%...

Mendex: Why I Build

Introduction Hello everyone. Today I want to share who I am, what I'm building, and why. Early Career and Burnout I started my career as a developer 17 years a...