I Stopped Fighting My Logging Tools and Built an AI Co-Investigator

Published: 2 months ago (February 25, 2026 at 05:43 PM EST)

7 min read

Source: Dev.to

Source: Dev.to

Source: Dev.to

TL;DR

I restructured my team’s scattered documentation into an AI‑queryable format, modeled every service’s Splunk log events as TypeScript types, and built an investigation workflow around it. Complex incident investigations went from ~2 hours to ~30 minutes, and the system gets smarter with every investigation archived.

Who This Is For

Backend and platform engineers dealing with on‑call rotations, incident investigation across multiple services, and documentation that never stays current.

The Problem

It’s 2 AM. PagerDuty wakes you up. The alert says something is wrong with a service you haven’t touched in months.

You open:

Splunk
New Relic
Your IDE
Slack

…and then you open your team’s documentation—Confluence, a wiki, whatever your team uses—and the real challenge starts. Not the tool’s fault. This is a people problem.

The documentation is scattered across dozens of pages written by different engineers in different eras.
Half of it is stale. The service was renamed six months ago and nobody updated the docs.
You’re scanning through walls of text looking for one critical detail while a production incident ticks upward.

This was our reality. I decided to fix it—not by writing better docs, but by rethinking what documentation is for.

Context

Our backend team ran a suite of Java Spring Boot services and Python Lambda functions on AWS.

Multiple services, inconsistent logging, and complex downstream dependencies spread across a large organisation.
When incidents occurred, engineers were expected to:
1. Acknowledge the incident within 5 minutes.
2. Begin investigating within 20 minutes.

In practice, “investigating” usually meant spending the first 20‑30 minutes just gathering context:

Which service is affected?
Where are the logs?
What does this log event mean?
Has this happened before?

The information existed—it was locked in Confluence pages, wiki entries, old Slack threads, and people’s heads. It wasn’t accessible at 2 AM, nor under pressure.

A new approach to documentation

Traditional documentation	AI‑ready documentation
“How do I explain this system to a person?”	“How do I make this system queryable?”

The answer: have both formats.

docs/ – comprehensive, narrative documentation for engineers to read and onboard with.
.context/ – dense, structured, AI‑queryable reference material.

Step 1 – Consolidate everything

Pull all existing documentation into a single repository:
- Confluence pages (converted from HTML to Markdown)
- GitHub Pages docs
- README files from key project repos
Ask the AI:
“What do industry best practices say about documenting projects that span multiple services, codebases, and teams? Let’s restructure what we have to align with those practices.”

The AI identified what was missing, stale, or mis‑categorised. We rebuilt the docs using those insights.

Step 2 – Service manifests

Using Claude, I scanned each codebase and created service manifests that describe:

What the service does
What it talks to / what talks to it

Result:

Mermaid diagrams of infrastructure topology
Sequence diagrams of key request flows
Inventories of cloud resources (DynamoDB tables, Lambda functions, S3 buckets) with associated repositories

The initial pass took about a week; thereafter I added services incrementally as investigations happened, so sprint velocity wasn’t impacted. This is a living effort, not a one‑and‑done project.

Step 3 – Normalise log formats

Each backend service logged to Splunk, but each had its own format.
A customerId might be a top‑level field in one service’s events but buried three levels deep in another.

This mystical knowledge lived only in the heads of experienced engineers.

Solution: generate TypeScript type definitions for each service’s log format.

// Example: one service's log event structure
interface ServiceLogEntry {
  timestamp: string;
  level: string;
  service: string;
  event: {
    type: string;
    customerId: string;
    // …other fields
  };
}

By codifying log schemas, we enable reliable querying, automated validation, and seamless AI‑assisted troubleshooting. This is where human‑AI collaboration truly shines.

Why Human‑AI Collaboration Matters

Problem	Human‑Only Approach	AI‑Assisted Approach
Command knowledge	Most engineers know only 5‑10 of the 140 Splunk commands.	AI knows every command (e.g., `stats`, `timechart`, `transaction`, `eventstats`) and can suggest the most appropriate one.
Cognitive load	Engineers juggle time windows, services, identifiers, and multiple tabs.	AI handles command selection, field renaming, and nested JSON navigation, letting engineers focus on reasoning.
Query debugging	Queries often fail; engineers spend minutes‑hours debugging while on calls.	AI writes, validates, and refines queries instantly.
Pattern detection	Scrolling through millions of events is slow and error‑prone.	AI scans sample results in seconds, spotting anomalies a human might miss.
Knowledge gaps	Few engineers are both Splunk experts and deep domain experts.	Engineers provide context; AI supplies meticulous command knowledge and data‑scanning capability.

“You bring the context and domain knowledge. The AI brings encyclopedic command knowledge and the ability to scan vast structured data. It’s a genuinely great partnership.”

Impact Metrics

Metric	Before	After
Investigation time (complex issue)	~2 hours	~30 minutes (≈ 75 % faster)
Investigation time (familiar issue)	~45 minutes	~10 minutes (≈ 78 % faster)
Query sophistication	Basic, inconsistent	Advanced, consistent patterns
Investigation documentation	Rarely created	14 investigations archived in month 1
Knowledge retention	Lost when engineers leave	Searchable investigation archive

Adoption Timeline

Pilot – Presented to ~20 engineers on the core team.
Roll‑out – Expanded to ~100 engineers, managers, and product owners across the organization.
Cross‑team interest – Multiple other teams have requested the same setup for their services.

Frequently Asked Questions

“Isn’t this just RAG?”

Most RAG implementations dump raw documents into a vector store and hope for the best.
Our value comes from curated documentation:

Structured service manifests
TypeScript log models
Splunk reference material
Archived investigations

Garbage in, garbage out applies to RAG just as much as anything else.

“Won’t AI hallucinate bad Splunk queries?”

Because the AI has structural knowledge of each service’s events, the generated queries are usually more accurate than a human’s memory‑based attempts.

You still validate every query by running it and checking the results. Those validations feed back into the knowledge base, improving future accuracy.

“My org won’t approve AI tools for production data.”

Documentation about production data can be stored with a stricter security posture.
Even without AI, the restructured documentation and TypeScript log models are valuable for onboarding, knowledge sharing, and maintaining accurate docs.

Start with documentation restructuring: version‑controlled, structured docs are far superior to scattered wiki pages. Once that foundation exists, adding an AI layer becomes much safer and more effective.

Foundational Pieces

1. Documentation Restructuring

Move from ad‑hoc wiki pages to version‑controlled, structured documentation.
When a service is renamed, AI can scan the entire doc set and update every reference—including misspellings, abbreviations, and contextual mentions that a simple find‑and‑replace would miss.

2. TypeScript Log Models (highest leverage)

Model your log‑event structures in TypeScript.
Captures massive tribal knowledge and provides a single source of truth for both humans and AI.

3. Investigation Archival

The first investigation is the hardest. By the 14th, the system already suggests relevant past incidents and proven query patterns.
Creates compound returns: each new archive improves the next investigation.

Call to Action

If your team deals with complex backend systems, on‑call rotations, and tribal knowledge that walks out the door when engineers leave, let’s talk.

What’s worked for you?
What hasn’t?

I’d love to hear how you’re approaching these challenges and explore how a similar human‑AI collaboration could help your organization.

I Stopped Fighting My Logging Tools and Built an AI Co-Investigator

TL;DR

Who This Is For

The Problem

Context

A new approach to documentation

Step 1 – Consolidate everything

Step 2 – Service manifests

Step 3 – Normalise log formats

Why Human‑AI Collaboration Matters

Impact Metrics

Adoption Timeline

Frequently Asked Questions

“Isn’t this just RAG?”

“Won’t AI hallucinate bad Splunk queries?”

“My org won’t approve AI tools for production data.”

Foundational Pieces

1. Documentation Restructuring

2. TypeScript Log Models (highest leverage)

3. Investigation Archival

Call to Action

Related posts

Closing the Gap: How a Mac Mini and AI Reignited 20 Years of Ideas

Why the Next Wave of Infrastructure Automation Requires a Different Kind of Intelligence

Microsoft Executives Warn AI Could Limit the Developer Talent Pipeline

The Missile Incident: AWS Data Centers Under Fire and What It Means

TL;DR

Who This Is For

The Problem

Context

A new approach to documentation

Step 1 – Consolidate everything

Step 2 – Service manifests

Step 3 – Normalise log formats

Why Human‑AI Collaboration Matters

Impact Metrics

Adoption Timeline

Frequently Asked Questions

“Isn’t this just RAG?”

“Won’t AI hallucinate bad Splunk queries?”

“My org won’t approve AI tools for production data.”

Foundational Pieces

1. Documentation Restructuring

2. TypeScript Log Models (highest leverage)

3. Investigation Archival

Call to Action

Related posts

Closing the Gap: How a Mac Mini and AI Reignited 20 Years of Ideas

Why the Next Wave of Infrastructure Automation Requires a Different Kind of Intelligence

Microsoft Executives Warn AI Could Limit the Developer Talent Pipeline

The Missile Incident: AWS Data Centers Under Fire and What It Means

Step 1 – Consolidate everything

Step 2 – Service manifests

Step 3 – Normalise log formats