I Stopped Fighting My Logging Tools and Built an AI Co-Investigator

Published: (February 25, 2026 at 05:43 PM EST)
7 min read
Source: Dev.to

Source: Dev.to

TL;DR

I restructured my team’s scattered documentation into an AI‑queryable format, modeled every service’s Splunk log events as TypeScript types, and built an investigation workflow around it. Complex incident investigations went from ~2 hours to ~30 minutes, and the system gets smarter with every investigation archived.


Who this is for

Backend and platform engineers dealing with on‑call rotations, incident investigation across multiple services, and documentation that never stays current.


The problem

It’s 2 AM. PagerDuty wakes you up. The alert says something is wrong with a service you haven’t touched in months.

You open:

  • Splunk
  • New Relic
  • Your IDE
  • Slack

…and then you open your team’s documentation—Confluence, a wiki, whatever your team uses—and the real challenge starts. Not the tool’s fault. This is a people problem.

  • The documentation is scattered across dozens of pages written by different engineers in different eras.
  • Half of it is stale. The service was renamed six months ago and nobody updated the docs.
  • You’re scanning through walls of text looking for one critical detail while a production incident ticks upward.

This was our reality. I decided to fix it—not by writing better docs, but by rethinking what documentation is for.


Context

Our backend team ran a suite of Java Spring Boot services and Python Lambda functions on AWS.

  • Multiple services, inconsistent logging, complex downstream dependencies spread across a large organisation.
  • When incidents happened, engineers were expected to acknowledge within 5 minutes and be investigating within 20 minutes.

But “investigating” usually meant spending the first 20‑30 minutes just gathering context:

  • Which service is this?
  • Where are the logs?
  • What does this log event mean?
  • Has this happened before?

The information existed—it was just locked in Confluence pages, wiki entries, old Slack threads, and people’s heads. Not accessible at 2 AM, not under pressure.


A new approach to documentation

Traditional documentationAI‑ready documentation
“How do I explain this system to a person?”“How do I make this system queryable?”

The answer: have both formats.

  • docs/ – comprehensive, narrative documentation for engineers to read and onboard with.
  • .context/ – dense, structured, AI‑queryable reference material.

Step 1 – Consolidate everything

  1. Pull all existing documentation into a single repository:

    • Confluence pages (converted from HTML to Markdown)
    • GitHub Pages docs
    • README files from key project repos
  2. Ask the AI:

    “What do industry best practices say about documenting projects that span multiple services, codebases, and teams? Let’s restructure what we have to align with those practices.”

The AI identified what was missing, stale, or mis‑categorised. We rebuilt the docs using those insights.


Step 2 – Service manifests

Using Claude, I scanned each codebase and created service manifests that describe:

  • What the service does
  • What it talks to / what talks to it

Result:

  • Mermaid diagrams of infrastructure topology
  • Sequence diagrams of key request flows
  • Inventories of cloud resources (DynamoDB tables, Lambda functions, S3 buckets) with associated repositories

The initial pass took about a week; thereafter I added services incrementally as investigations happened, so sprint velocity wasn’t impacted. This is a living effort, not a one‑and‑done project.


Step 3 – Normalise log formats

Each backend service logged to Splunk, but each had its own format.
A customerId might be a top‑level field in one service’s events but buried three levels deep in another.

This mystical knowledge lived only in the heads of experienced engineers.

Solution: generate TypeScript type definitions for each service’s log format.

// Example: one service's log event structure
interface ServiceLogEntry {
  timestamp: string;
  level: string;
  service: string;
  event: {
    type: string;
    customerId: string; //  *“We logged 14 investigations in the first month. By the end of that month, the system was already surfacing relevant past incidents and proven query patterns when new alerts came in.”*  

This is where humanAI collaboration truly shines.

---

## Why HumanAI Collaboration Matters  

| Problem | HumanOnly Approach | AIAssisted Approach |
|---------|-------------------|----------------------|
| **Command knowledge** | Most engineers know only 510 of the 140 Splunk commands. | AI knows every command (e.g., `stats`, `timechart`, `transaction`, `eventstats`) and can suggest the most appropriate one. |
| **Cognitive load** | Engineers juggle time windows, services, identifiers, and multiple tabs. | AI handles command selection, field renaming, and nested JSON navigation, letting engineers focus on reasoning. |
| **Query debugging** | Queries often fail; engineers spend minuteshours debugging while on calls. | AI writes, validates, and refines queries instantly. |
| **Pattern detection** | Scrolling through millions of events is slow and errorprone. | AI scans sample results in seconds, spotting anomalies a human might miss. |
| **Knowledge gaps** | Few engineers are both Splunk experts *and* deep domain experts. | Engineers provide context; AI supplies meticulous command knowledge and datascanning capability. |

> *“You bring the context and domain knowledge. The AI brings encyclopedic command knowledge and the ability to scan vast structured data. Its a genuinely great partnership.”*

---

## Impact Metrics  

| Metric | Before | After |
|--------|--------|-------|
| **Investigation time (complex issue)** | ~2 hours | ~30 minutes (≈ 75 % faster) |
| **Investigation time (familiar issue)** | ~45 minutes | ~10 minutes (≈ 78 % faster) |
| **Query sophistication** | Basic, inconsistent | Advanced, consistent patterns |
| **Investigation documentation** | Rarely created | 14 investigations archived in month 1 |
| **Knowledge retention** | Lost when engineers leave | Searchable investigation archive |

---

## Adoption Timeline  

1. **Pilot** – Presented to ~20 engineers on the core team.  
2. **Rollout** – Expanded to ~100 engineers, managers, and product owners across the organization.  
3. **Crossteam interest** – Multiple other teams have requested the same setup for their services.

---

## Frequently Asked Questions  

### “Isnt this just RAG?

Most RAG implementations dump raw documents into a vector store and hope for the best.  
Our value comes from **curated documentation**:  

- Structured service manifests  
- TypeScript log models  
- Splunk reference material  
- Archived investigations  

*Garbage in, garbage out* applies to RAG just as much as anything else.

---

### “Wont AI hallucinate bad Splunk queries?

Because the AI has **structural knowledge of each services events**, the generated queries are usually more accurate than a humans memorybased attempts.  
You still **validate** every query by running it and checking the results. Those validations feed back into the knowledge base, improving future accuracy.

---

### “My org wont approve AI tools for production data.”  

- **Documentation** about production data can be stored with a stricter security posture.  
- Even without AI, the **restructured documentation** and **TypeScript log models** are valuable for onboarding, knowledge sharing, and maintaining accurate docs.  

**Start with documentation restructuring**: versioncontrolled, structured docs are far superior to scattered wiki pages. Once that foundation exists, adding an AI layer becomes much safer and more effective.

---

## Foundational Pieces  

1. **Documentation Restructuring**  
   - Move from adhoc wiki pages to versioncontrolled, structured docs.  
   - When a service is renamed, AI can scan the entire doc set and update every referenceincluding misspellings, abbreviations, and contextual mentions that a simple findandreplace would miss.  

2. **TypeScript Log Models** *(highest leverage)*  
   - Model your log event structures in TypeScript.  
   - Captures massive tribal knowledge and provides a single source of truth for both humans and AI.  

3. **Investigation Archival**  
   - The first investigation is the hardest. By the 14th, the system already suggests relevant past incidents and proven query patterns.  
   - Creates compound returns: each new archive improves the next investigation.  

---

## Call to Action  

If your team deals with complex backend systems, oncall rotations, and tribal knowledge that walks out the door when engineers leave, lets talk.  

- **Whats worked for you?**  
- **What hasnt?**  

Id love to hear how youre approaching these challenges and explore how a similar humanAI collaboration could help your organization.
0 views
Back to Blog

Related posts

Read more »

[Boost]

Profile !Vincent A. Cicirellohttps://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaw...