How I Built an Autonomous Incident Investigation Agent That Reduced MTTR by 65%

Published: (June 18, 2026 at 12:22 AM EDT)
10 min read
Source: Dev.to

Source: Dev.to

Series: AI-Native SRE

Table of Contents

The Problem Every On-Call Engineer Knows What FRIDAY Does Architecture Overview Key Design Decisions The Tool-Use Loop: How FRIDAY Reasons The Training System: Pre-Built Knowledge Handling Edge Cases Results Lessons Learned Try It Yourself

It’s 2:47 AM. Your phone buzzes and it’s a P1 alert. You open your laptop, bleary-eyed, and begin the familiar ritual: Open PagerDuty → read the alert title Open Datadog → search for the service, find the error spike Open GitHub → check if someone deployed something Cross-reference timestamps between all three tools Form a hypothesis Drill deeper — check affected tenants, error paths, queue depths Write up findings for the team This process takes 15–45 minutes for an experienced engineer. For a junior on-call? Sometimes hours. And the cognitive overhead of context-switching between 3-4 tools while sleep-deprived leads to missed signals, false conclusions, and longer outages. I asked myself: So I built one. It’s been running in production for months, investigating real incidents on a platform serving 30+ million end users across multiple AWS regions. We call it FRIDAY.

When a PagerDuty alert fires, FRIDAY: Receives the webhook in real-time via API Gateway Locks the target region from the alert metadata (never investigates the wrong region) Checks GitHub first — finds what changed before the alert (deployments, config changes, PRs) Queries Datadog — error rates, affected tenants, application exceptions, queue depths Synthesizes findings — correlates code changes with observability signals Delivers a structured report to Microsoft Teams as an Adaptive Card The entire investigation takes under 2 minutes. The on-call engineer wakes up to a complete analysis instead of a raw alert.

┌──────────────┐ ┌────────────────┐ ┌─────────────────────┐ │ PagerDuty │────▶│ API Gateway │────▶│ Lambda (Sync) │ │ Webhook │ │ (Validate) │ │ Parse + Self-Invoke│ └──────────────┘ └────────────────┘ └─────────┬───────────┘ │ Async ▼ ┌─────────────────────┐ │ Lambda (Async) │ │ Investigation Agent │ │ │ │ ┌────────────────┐ │ │ │ Amazon Bedrock │ │ │ │ Claude Opus │ │ │ │ (Tool-Use Loop) │ │ │ └───────┬────────┘ │ │ │ │ │ ┌─────┼─────┐ │ │ ▼ ▼ ▼ │ │ GitHub Datadog S3 │ └─────────┬───────────┘ │ ▼ ┌─────────────────────┐ │ Microsoft Teams │ │ (Adaptive Card) │ └─────────────────────┘

  1. Two-Lambda Architecture (Sync + Async)

API Gateway has a 30-second hard timeout. A thorough AI investigation takes 60–180 seconds. The solution: the sync Lambda validates the webhook, parses the alert, and immediately self-invokes asynchronously returning 200 OK to PagerDuty within 2 seconds.

Sync handler: validate, parse, self-invoke, return immediately

lambda_client.invoke( FunctionName=context.function_name, InvocationType=“Event”, # Fire and forget Payload=json.dumps({ “_async_investigate”: True, “alert_payload”: alert_payload, }), ) return {“statusCode”: 200, “body”: “Investigation started”}

The async Lambda runs the full investigation without timeout pressure. This is counterintuitive. Most engineers and most AI systems jump straight to observability data when an alert fires. But in my experience, 80%+ of acute incidents are caused by a preceding change: a deployment, a config update, a replica count change, a memory limit modification. FRIDAY is instructed to check GitHub before touching Datadog: MANDATORY FIRST STEP — GitHub (Step 0): Before touching Datadog, you MUST run these calls in parallel:

  1. github_search_repos — find the repo for the alerted service
  2. github_list_commits — find commits in the 2 hours before the alert fired

A deployment or config change is the most likely root cause.

Why this matters: When the AI correlates “this PR merged 12 minutes before the error spike” with “5xx errors started at exactly the merge timestamp” — it produces findings that are immediately actionable. This single design decision dramatically improved root cause accuracy. Our platform spans multiple AWS regions. A naive agent querying “all 5xx errors” would mix signals from healthy and unhealthy regions, producing confused analysis. FRIDAY’s first action is always to lock a target region from the alert metadata: 🌍 Region: Description — resolved from alert hostname

Every subsequent Datadog query includes: kube_cluster_name:region-az-* (scoped to affected region only)

This eliminated an entire class of false-positive findings where the AI would cite errors from an unrelated region. FRIDAY’s output isn’t freeform text. It follows a strict section contract that the Teams integration parses into visual containers:

EXECUTIVE SUMMARY

[2-3 sentences — what happened, who’s affected, what changed]

KEY FINDINGS

[Bulleted evidence from GitHub + Datadog]

WHAT CHANGED

[Specific commit/PR with timestamp and author]

ERROR BREAKDOWN

[Service-by-service error counts with affected tenants]

ROOT CAUSE

[Confirmed / Suspected / Unknown — with evidence chain]

CUSTOMER IMPACT

[Affected tenants, operations, scope]

[Specific next steps for the on-call engineer]

The on-call engineer can glance at the Teams card and immediately know: what happened, who’s affected, what likely caused it, and what to do next without reading a wall of text.

FRIDAY uses Claude’s tool-use capability in a multi-round loop. The AI doesn’t execute a fixed script — it reasons about each alert independently, deciding which tools to call based on what it’s learned so far. for round_num in range(MAX_TOOL_ROUNDS): # Max 25 rounds response = bedrock_client.converse( modelId=“anthropic.claude-opus”, messages=messages, toolConfig={“tools”: TOOL_DEFINITIONS}, )

if stop_reason == "tool_use":
    # Execute tools, append results, continue reasoning
    for tool_call in content_blocks:
        result = execute_tool(
            tool_call["name"], 
            tool_call["input"]
        )
        tool_results.append(result)
    messages.append(tool_results)

elif stop_reason == "end_turn":
    # AI has concluded — extract findings
    return extract_final_report(content_blocks)

Tool Purpose

github_search_repos Find which repo owns a service

github_list_commits What changed before the alert

github_get_file Read actual deployment configs

github_search_code Find all producers/consumers of a queue

datadog_log_search Find specific error messages

datadog_log_aggregate Count errors by backend/tenant/path

datadog_query_metrics Queue depth, CPU, memory, latency

datadog_get_monitor Understand what threshold triggered

The AI typically uses 8–15 tool calls per investigation, batching parallel calls when possible to minimize round-trip time.

A cold investigation — where the AI knows nothing about your infrastructure — is slow and imprecise. FRIDAY includes a deterministic training mode that pre-builds architectural knowledge: def train(): """ Deterministic training: ~13 targeted API calls, then one Bedrock synthesis call.

Collects: cluster-service maps, HAProxy backends, 
chronic error baselines, recent planned work.
"""
# Phase 1: Targeted data collection (no AI — pure API calls)
collected = {}
for key, tool_name, tool_input in TRAINING_CALLS:
    collected[key] = execute_tool(tool_name, tool_input)

# Phase 2: Single AI synthesis call
knowledge_doc = synthesize_knowledge(collected)

# Phase 3: Save to S3 — injected into system prompt
save_to_s3(knowledge_doc)

The knowledge document contains: Cluster → Service map — What runs where Chronic error baselines — Background noise to ignore (not incidents) Recent planned work — Deployments and migrations that explain expected errors Backend inventory — Every backend serving traffic Key insight: Knowledge injection > Larger context windows. A synthesized knowledge document — curated, current, and actionable — is more effective than dumping raw infrastructure documentation into the prompt. It captures real state, not aspirational state.

Planned Work vs. Real Incidents

One of the hardest problems: distinguishing planned maintenance from real outages. During a Kubernetes cluster migration, you expect 5xx errors as traffic drains. FRIDAY handles this through: Knowledge injection — Training mode captures recent PRs tagged as planned work Real-time PR correlation — During investigation, it reads PR bodies for keywords like “decommission”, “drain”, “planned” Explicit classification — If a 5xx spike coincides with a merged “failover” PR, FRIDAY reports: “This alert coincides with planned cluster decommission. Errors are expected during traffic drain. No incident action required.” What happens when an investigation is complex and approaching the 25-round tool limit? FRIDAY has a graceful degradation mechanism: if rounds_remaining <= 3: user_content.append({ “text”: ( “STOP CALLING TOOLS. Write your FINAL report ” “NOW using all data collected so far. Mark ” “uncertain findings as ‘Suspected’ rather ” “than skipping them.” ) })

This ensures every investigation produces a report — even if incomplete — rather than timing out silently. PagerDuty retries webhooks. FRIDAY handles this at two levels: Webhook-level — In-memory cache of webhook IDs (survives Lambda warm starts) Incident-level — S3 marker files prevent re-investigating the same incident

After running in production for several months:

Metric Before FRIDAY After FRIDAY Improvement

Mean Time to First Analysis 15–45 min 90 sec–3 min ~90% faster

MTTR (overall) ~60 min ~15 min 65% reduction

AI tool adoption (team) 20% 85% 4x increase

Alert noise (false escalations) High Minimal ~80% reduction

Auto-generated postmortems 0% 100% of P1/P2 Eliminated manual RCA drafts

consistency. A human engineer at 3 AM makes mistakes: investigates the wrong region, misses a recent deployment, forgets to check queue depths. FRIDAY follows the same rigorous methodology every time.

  1. Prompt Engineering IS Architecture

The system prompt is the most important file in the codebase. It’s not instructions — it’s the agent’s operating manual. Ours is ~5,000 words covering: Environment topology (region mappings, cluster roles, service dependencies) Investigation methodology (step-by-step procedures) Critical rules (what NOT to do — as important as what to do) Output format contract Invest in your prompt like you invest in your architecture docs. Before this rule, the AI would spend 10+ rounds querying Datadog, building elaborate theories about traffic patterns — then discover a config change was merged 5 minutes before the alert. Now it finds the root cause in rounds 1-2 for ~80% of change-induced incidents. FRIDAY is explicitly told it does NOT take remediation actions. It investigates, analyzes, and reports. A human validates and acts. This is not a limitation — it’s a design choice that builds trust. When on-call engineers trust the AI’s analysis, they act on it faster. The two-Lambda pattern (sync for webhook receipt, async for investigation) is essential. Don’t let API Gateway timeouts dictate your AI agent’s investigation depth. We’re extending this pattern to autonomous security remediation — an agent that ingests vulnerability findings, generates IaC fixes, deploys through GitOps, verifies no impact, and requests human approval before proceeding. Same tool-use architecture, different domain. AI-native: systems designed from the ground up with autonomous agents as first-class participants in the operational loop.

The pattern is reproducible with: Amazon Bedrock (Claude Opus or Sonnet for cost-sensitive use) Any webhook source (PagerDuty, Opsgenie, Datadog) Any observability platform with an API (Datadog, Grafana, New Relic) Any source control (GitHub, GitLab) Any chat platform (Teams, Slack) The hard part isn’t the code — it’s the system prompt. That’s where your SRE expertise lives. The AI is the execution engine; your knowledge of your infrastructure is what makes it useful.

What does FRIDAY stand for?

The name also works as a backronym: First Responder for Incident Diagnostics and AnalYsis — but honestly, we just thought the Marvel reference was cooler.

I’m Vinothsingh Elumalai, a Platform Engineering leader building AI-native operations at enterprise scale. I lead the Platform team for a global IAM/SSO platform serving 30M+ users. Currently exploring how agentic AI transforms SRE from reactive firefighting to autonomous, closed-loop operations. This is Part 1 of my AI-Native SRE series. Part 2 will cover JARVIS — an autonomous vulnerability remediation agent that fixes security findings through GitOps with human approval gates. Connect on LinkedIn

0 views
Back to Blog

Related posts

Read more »

The Model Doesn't Remember. You Do

Introduction Before I dug into how an LLM works, I assumed each chat stored its memory or context in its own. The moment I realized it was just an array with al...