I Stopped Fighting My Logging Tools and Built an AI Co-Investigator
Source: Dev.to
TL;DR
I restructured my team’s scattered documentation into an AI‑queryable format, modeled every service’s Splunk log events as TypeScript types, and built an investigation workflow around it. Complex incident investigations went from ~2 hours to ~30 minutes, and the system gets smarter with every investigation archived.
Who this is for
Backend and platform engineers dealing with on‑call rotations, incident investigation across multiple services, and documentation that never stays current.
The problem
It’s 2 AM. PagerDuty wakes you up. The alert says something is wrong with a service you haven’t touched in months.
You open:
- Splunk
- New Relic
- Your IDE
- Slack
…and then you open your team’s documentation—Confluence, a wiki, whatever your team uses—and the real challenge starts. Not the tool’s fault. This is a people problem.
- The documentation is scattered across dozens of pages written by different engineers in different eras.
- Half of it is stale. The service was renamed six months ago and nobody updated the docs.
- You’re scanning through walls of text looking for one critical detail while a production incident ticks upward.
This was our reality. I decided to fix it—not by writing better docs, but by rethinking what documentation is for.
Context
Our backend team ran a suite of Java Spring Boot services and Python Lambda functions on AWS.
- Multiple services, inconsistent logging, complex downstream dependencies spread across a large organisation.
- When incidents happened, engineers were expected to acknowledge within 5 minutes and be investigating within 20 minutes.
But “investigating” usually meant spending the first 20‑30 minutes just gathering context:
- Which service is this?
- Where are the logs?
- What does this log event mean?
- Has this happened before?
The information existed—it was just locked in Confluence pages, wiki entries, old Slack threads, and people’s heads. Not accessible at 2 AM, not under pressure.
A new approach to documentation
| Traditional documentation | AI‑ready documentation |
|---|---|
| “How do I explain this system to a person?” | “How do I make this system queryable?” |
The answer: have both formats.
docs/– comprehensive, narrative documentation for engineers to read and onboard with..context/– dense, structured, AI‑queryable reference material.
Step 1 – Consolidate everything
-
Pull all existing documentation into a single repository:
- Confluence pages (converted from HTML to Markdown)
- GitHub Pages docs
- README files from key project repos
-
Ask the AI:
“What do industry best practices say about documenting projects that span multiple services, codebases, and teams? Let’s restructure what we have to align with those practices.”
The AI identified what was missing, stale, or mis‑categorised. We rebuilt the docs using those insights.
Step 2 – Service manifests
Using Claude, I scanned each codebase and created service manifests that describe:
- What the service does
- What it talks to / what talks to it
Result:
- Mermaid diagrams of infrastructure topology
- Sequence diagrams of key request flows
- Inventories of cloud resources (DynamoDB tables, Lambda functions, S3 buckets) with associated repositories
The initial pass took about a week; thereafter I added services incrementally as investigations happened, so sprint velocity wasn’t impacted. This is a living effort, not a one‑and‑done project.
Step 3 – Normalise log formats
Each backend service logged to Splunk, but each had its own format.
A customerId might be a top‑level field in one service’s events but buried three levels deep in another.
This mystical knowledge lived only in the heads of experienced engineers.
Solution: generate TypeScript type definitions for each service’s log format.
// Example: one service's log event structure
interface ServiceLogEntry {
timestamp: string;
level: string;
service: string;
event: {
type: string;
customerId: string; // *“We logged 14 investigations in the first month. By the end of that month, the system was already surfacing relevant past incidents and proven query patterns when new alerts came in.”*
This is where human‑AI collaboration truly shines.
---
## Why Human‑AI Collaboration Matters
| Problem | Human‑Only Approach | AI‑Assisted Approach |
|---------|-------------------|----------------------|
| **Command knowledge** | Most engineers know only 5‑10 of the 140 Splunk commands. | AI knows every command (e.g., `stats`, `timechart`, `transaction`, `eventstats`) and can suggest the most appropriate one. |
| **Cognitive load** | Engineers juggle time windows, services, identifiers, and multiple tabs. | AI handles command selection, field renaming, and nested JSON navigation, letting engineers focus on reasoning. |
| **Query debugging** | Queries often fail; engineers spend minutes‑hours debugging while on calls. | AI writes, validates, and refines queries instantly. |
| **Pattern detection** | Scrolling through millions of events is slow and error‑prone. | AI scans sample results in seconds, spotting anomalies a human might miss. |
| **Knowledge gaps** | Few engineers are both Splunk experts *and* deep domain experts. | Engineers provide context; AI supplies meticulous command knowledge and data‑scanning capability. |
> *“You bring the context and domain knowledge. The AI brings encyclopedic command knowledge and the ability to scan vast structured data. It’s a genuinely great partnership.”*
---
## Impact Metrics
| Metric | Before | After |
|--------|--------|-------|
| **Investigation time (complex issue)** | ~2 hours | ~30 minutes (≈ 75 % faster) |
| **Investigation time (familiar issue)** | ~45 minutes | ~10 minutes (≈ 78 % faster) |
| **Query sophistication** | Basic, inconsistent | Advanced, consistent patterns |
| **Investigation documentation** | Rarely created | 14 investigations archived in month 1 |
| **Knowledge retention** | Lost when engineers leave | Searchable investigation archive |
---
## Adoption Timeline
1. **Pilot** – Presented to ~20 engineers on the core team.
2. **Roll‑out** – Expanded to ~100 engineers, managers, and product owners across the organization.
3. **Cross‑team interest** – Multiple other teams have requested the same setup for their services.
---
## Frequently Asked Questions
### “Isn’t this just RAG?”
Most RAG implementations dump raw documents into a vector store and hope for the best.
Our value comes from **curated documentation**:
- Structured service manifests
- TypeScript log models
- Splunk reference material
- Archived investigations
*Garbage in, garbage out* applies to RAG just as much as anything else.
---
### “Won’t AI hallucinate bad Splunk queries?”
Because the AI has **structural knowledge of each service’s events**, the generated queries are usually more accurate than a human’s memory‑based attempts.
You still **validate** every query by running it and checking the results. Those validations feed back into the knowledge base, improving future accuracy.
---
### “My org won’t approve AI tools for production data.”
- **Documentation** about production data can be stored with a stricter security posture.
- Even without AI, the **restructured documentation** and **TypeScript log models** are valuable for onboarding, knowledge sharing, and maintaining accurate docs.
**Start with documentation restructuring**: version‑controlled, structured docs are far superior to scattered wiki pages. Once that foundation exists, adding an AI layer becomes much safer and more effective.
---
## Foundational Pieces
1. **Documentation Restructuring**
- Move from ad‑hoc wiki pages to version‑controlled, structured docs.
- When a service is renamed, AI can scan the entire doc set and update every reference—including misspellings, abbreviations, and contextual mentions that a simple find‑and‑replace would miss.
2. **TypeScript Log Models** *(highest leverage)*
- Model your log event structures in TypeScript.
- Captures massive tribal knowledge and provides a single source of truth for both humans and AI.
3. **Investigation Archival**
- The first investigation is the hardest. By the 14th, the system already suggests relevant past incidents and proven query patterns.
- Creates compound returns: each new archive improves the next investigation.
---
## Call to Action
If your team deals with complex backend systems, on‑call rotations, and tribal knowledge that walks out the door when engineers leave, let’s talk.
- **What’s worked for you?**
- **What hasn’t?**
I’d love to hear how you’re approaching these challenges and explore how a similar human‑AI collaboration could help your organization.