CloudWatch Investigations: Your AI-Powered Troubleshooting Sidekick

Published: 1 month ago (January 4, 2026 at 03:04 PM EST)

4 min read

Source: Dev.to

Remember those 3 AM incidents when you’re frantically switching between dashboards, digging through logs, and wondering if you should just restart everything?
We’ve all worked non‑business hours, weekends, or midnights to troubleshoot production issues – an energy‑draining task.
What if, in this GenAI world, we had an AI assistant that works 24 × 7 and guides us through the chaos?

Enter CloudWatch Investigations, a generative‑AI‑powered feature that’s changing how we handle incidents in AWS environments.

How It Works

When something breaks, instead of you jumping between CloudWatch metrics, logs, deployment history, CloudTrail, X‑Ray, and health dashboards, CloudWatch Investigations does the first round of detective work for you.

It uses generative AI to scan your system’s telemetry and quickly surface:

Suspicious metrics
Relevant logs
Recent deployments or configuration changes
Possible root‑cause hypotheses (especially when multiple resources are involved)

All of this is presented visually, so you can see how things are connected instead of guessing. It’s like having an extra team member who’s been staring at your system architecture 24 / 7.

Getting Started

Open the console – In the AWS Console go to CloudWatch → AI Operations (left pane).
First‑time setup – If this is the first time you’re configuring the account, you’ll be prompted to set up an Investigation Group.

Create an Investigation Group

Setting	Description
Retention days	How long investigations are kept. Note: the retention period cannot be changed after it’s set.
Customise encryption	Use a customer‑managed KMS key for encryption. Make sure the required permissions are granted to the key.
IAM role	CloudWatch Investigations creates a role with the required read‑only permissions. You can also create a custom role. By default it attaches: • `AIOpsAssistantPolicy` • `AmazonRDSPerformanceInsightsFullAccess` • `AIOpsAssistantIncidentReportPolicy`

Once the group is created you’ll see Optional Enhanced Configuration options.

Enhanced configuration UI

Enhanced Integration Options

Application tags – Include tags related to your application to help CloudWatch narrow down investigations.
CloudTrail access – Enables the service to discover relevant change events.
Optional data sources – X‑Ray, Application Signals, and EKS access entries.

Demo

The Sample Application

For this demo we use a simple Event‑Booking app:

Event‑booking architecture

User flow – Users book appointments by providing details and selecting an available slot.
Admin flow – Admins approve or reject requests.

User booking UI

Admin UI

Introducing a Failure

Modify Lambda role – Remove the KMS permission from the Lambda execution role.
Simulated outage – Users start seeing errors when trying to view slots, and the admin cannot see any appointments.
Initial investigation – The entry point for the app is CloudFront. Checking CloudFront shows a spike in 5xx errors.

(Continue the investigation using CloudWatch Investigations – it will automatically surface the missing KMS permission, point out the affected Lambda, and suggest remediation steps.)

Recap

CloudWatch Investigations leverages generative AI to turn raw telemetry into actionable insights.
It reduces mean‑time‑to‑detect (MTTD) and mean‑time‑to‑resolve (MTTR) by automating the first detective steps.
The feature integrates with existing AWS services (CloudWatch, CloudTrail, X‑Ray, etc.) and can be fine‑tuned with tags and optional data sources.

Give it a try the next time you’re pulled into a 3 AM incident – let the AI do the heavy lifting while you focus on fixing the root cause.

Starting Investigation

Under CloudWatch metrics 5xx, you can start an investigation to find out why the service is returning 5xx errors.

The view will automatically pick the most recent timestamp, but you can adjust the start time if you wish.
Once the investigation is started, it takes 10‑15 minutes to finish. You can watch the progress, or use that time to communicate with users/businesses or start other parallel activities.
When the investigation completes, it clearly shows what went wrong and why we are getting 5xx errors 🥳🥳🥳

Under Root Cause Summary the issue was identified as an IAM configuration problem.

Analysis – This failure pattern represents an IAM configuration issue rather than a service degradation, as evidenced by the specific KMS permission errors and the “NEW” occurrence pattern indicating a recent permission change affecting the eventap staging service components.

CloudWatch Investigations: Your AI-Powered Troubleshooting Sidekick

How It Works

Getting Started

Create an Investigation Group

Enhanced Integration Options

Demo

The Sample Application

Introducing a Failure

Recap

Starting Investigation

Related posts

The RGB LED Sidequest 💡

Zapier vs. Custom Code: When to Fire Your 'Glue' Tool

Mendex: Why I Build

Why Apache Ozone is the Preferred Object Store for Big Data