CloudWatch Investigations: Your AI-Powered Troubleshooting Sidekick
Source: Dev.to
Remember those 3 AM incidents when you’re frantically switching between dashboards, digging through logs, and wondering if you should just restart everything?
We’ve all worked non‑business hours, weekends, or midnights to troubleshoot production issues – an energy‑draining task.
What if, in this GenAI world, we had an AI assistant that works 24 × 7 and guides us through the chaos?
Enter CloudWatch Investigations, a generative‑AI‑powered feature that’s changing how we handle incidents in AWS environments.
How It Works
When something breaks, instead of you jumping between CloudWatch metrics, logs, deployment history, CloudTrail, X‑Ray, and health dashboards, CloudWatch Investigations does the first round of detective work for you.
It uses generative AI to scan your system’s telemetry and quickly surface:
- Suspicious metrics
- Relevant logs
- Recent deployments or configuration changes
- Possible root‑cause hypotheses (especially when multiple resources are involved)
All of this is presented visually, so you can see how things are connected instead of guessing. It’s like having an extra team member who’s been staring at your system architecture 24 / 7.
Getting Started
- Open the console – In the AWS Console go to CloudWatch → AI Operations (left pane).
- First‑time setup – If this is the first time you’re configuring the account, you’ll be prompted to set up an Investigation Group.
Create an Investigation Group
| Setting | Description |
|---|---|
| Retention days | How long investigations are kept. Note: the retention period cannot be changed after it’s set. |
| Customise encryption | Use a customer‑managed KMS key for encryption. Make sure the required permissions are granted to the key. |
| IAM role | CloudWatch Investigations creates a role with the required read‑only permissions. You can also create a custom role. By default it attaches: • AIOpsAssistantPolicy • AmazonRDSPerformanceInsightsFullAccess • AIOpsAssistantIncidentReportPolicy |
Once the group is created you’ll see Optional Enhanced Configuration options.

Enhanced Integration Options
- Application tags – Include tags related to your application to help CloudWatch narrow down investigations.
- CloudTrail access – Enables the service to discover relevant change events.
- Optional data sources – X‑Ray, Application Signals, and EKS access entries.
Demo
The Sample Application
For this demo we use a simple Event‑Booking app:

- User flow – Users book appointments by providing details and selecting an available slot.
- Admin flow – Admins approve or reject requests.


Introducing a Failure
-
Modify Lambda role – Remove the KMS permission from the Lambda execution role.

-
Simulated outage – Users start seeing errors when trying to view slots, and the admin cannot see any appointments.

-
Initial investigation – The entry point for the app is CloudFront. Checking CloudFront shows a spike in 5xx errors.

(Continue the investigation using CloudWatch Investigations – it will automatically surface the missing KMS permission, point out the affected Lambda, and suggest remediation steps.)
Recap
- CloudWatch Investigations leverages generative AI to turn raw telemetry into actionable insights.
- It reduces mean‑time‑to‑detect (MTTD) and mean‑time‑to‑resolve (MTTR) by automating the first detective steps.
- The feature integrates with existing AWS services (CloudWatch, CloudTrail, X‑Ray, etc.) and can be fine‑tuned with tags and optional data sources.
Give it a try the next time you’re pulled into a 3 AM incident – let the AI do the heavy lifting while you focus on fixing the root cause.
Starting Investigation
-
Under CloudWatch metrics 5xx, you can start an investigation to find out why the service is returning 5xx errors.

The view will automatically pick the most recent timestamp, but you can adjust the start time if you wish.

-
Once the investigation is started, it takes 10‑15 minutes to finish. You can watch the progress, or use that time to communicate with users/businesses or start other parallel activities.
-
When the investigation completes, it clearly shows what went wrong and why we are getting 5xx errors 🥳🥳🥳
Under Root Cause Summary the issue was identified as an IAM configuration problem.
Analysis – This failure pattern represents an IAM configuration issue rather than a service degradation, as evidenced by the specific KMS permission errors and the “NEW” occurrence pattern indicating a recent permission change affecting the eventap staging service components.
