AI For Debugging Production Issues

Published: (June 13, 2026 at 11:46 PM EDT)
12 min read
Source: Dev.to

Source: Dev.to

It’s 2:47am. The pager has just gone off for the third time in twenty minutes. Checkout latency is spiking. The error rate on /api/orders is climbing. Slack is filling with screenshots of half-finished trace views. Somewhere in your logs, the answer is sitting there in plain text, buried under a few million other lines that all look just as urgent. This is the moment people are talking about when they say “AI is going to change how we debug production.” Not the demo where someone asks ChatGPT to write a regex. The 2:47am moment. The one where a tired human has to hold five tabs open in their head and form a hypothesis before the executive team starts asking for an ETA. It turns out that’s where the technology has the most to offer, and also where it embarrasses itself most often. Let’s break down what’s actually working in 2026, where the seams still show, and how to wire an LLM into your incident-response loop so it earns its keep instead of just adding another window to glance at. The two boring superpowers first: reading fast and correlating across heterogeneous signals. Those are the things humans get worst at when they’re tired and time-pressured, and they’re the things a good LLM does at the same speed at 2am as at 2pm. Datadog’s Bits AI SRE, which the company benchmarked against real incidents from hundreds of internal Datadog teams, is built around exactly this insight: an agent that can fan out across metrics, logs, traces, recent deploys, and incident history simultaneously, then collapse the findings into a single readable narrative. Datadog runs the agent against tens of thousands of evaluation scenarios and claims time-to-resolution wins of up to 95% in its published material. That headline number is marketing (you should always read it as “in the cases where the agent worked, this is what it shaved”), but the underlying capability is real, and it isn’t unique to Datadog. Honeycomb’s Query Assistant has been letting engineers ask trace questions in plain English since 2023. Open-source toolkits like OpenSRE plug an LLM into a long list of observability tools (Datadog, Honeycomb, CloudWatch, Sentry, Elasticsearch) so you can run the same idea on your own stack. Here’s the part that’s easy to miss when you read the announcements: the AI isn’t doing your job. It’s doing the part of your job that’s the most boring and the most cognitively expensive at the same time: the “I have to hold this whole system in my head right now” part. That’s a real win even if it never proposes a single correct fix on its own. The other thing worth saying out loud: AI is bad at the parts of debugging that look easy. It cannot tell you whether the incident is real. A model fed twenty thousand log lines will happily build a beautiful narrative of cascading failure even when the actual answer is “someone restarted the metrics agent and the dashboard panicked.” It has no skin in the game. If you ask it to find a root cause, it will find one. That is the entire game. There is also the chain-of-thought trap, which the academic literature has been chewing on for a while. A 2025 paper on arXiv (“Chain-of-Thought Prompting Obscures Hallucination Cues in Large Language Models”) showed that asking a model to reason out loud can simultaneously reduce the rate of hallucinated facts and make the remaining hallucinations much harder to detect, because the reasoning trail makes a fabricated conclusion look more credible. In practical terms: a confident, well-reasoned AI explanation of your outage is not evidence that the explanation is correct. It is evidence that the model is good at producing reasoning trails. Those are different things. So treat the model’s output the same way you’d treat a junior engineer’s first guess during an incident: take it seriously, ask where it came from, and verify before you act. Logs are the most obvious place to point an LLM. Most teams that start using AI for incident response start there: pipe a window of recent logs into the prompt, ask the model what it sees. This works surprisingly well for pattern surfacing: “there’s a spike of ECONNREFUSED to payments-internal starting at 02:39, followed two minutes later by a wave of 504s from the orders service.” A human can see that too, but the human has to scroll. The model spots it in one pass. It is much worse at rare-but-meaningful lines. A single WARN: replica lag exceeded threshold buried among ten thousand routine INFO lines is the kind of thing a tired human notices because it looks weird and the model misses because it didn’t fit the dominant pattern. The lesson, and a lot of teams have learned this the slow way, is that you should not give the LLM raw log streams as your only signal. Use structured logs, pre-filter for severity, surface anomalies via your normal observability tooling, and then ask the model to interpret the filtered set. Garbage in, confident-sounding garbage out. There’s also a context-window economics issue. Even with the current generation of long-context models, dumping a million log lines into a prompt is expensive and slow, and the model’s accuracy degrades on the middle of the context window, the so-called “lost in the middle” problem that has been documented across multiple long-context benchmarks. The practical pattern is retrieval-augmented: vector-store your historical logs and recent incident transcripts, then pull only the slices that match the current signal. Pinecone, Weaviate, and Chroma are the obvious building blocks; pgvector is fine if you already run Postgres. Traces are where the LLM-as-teammate framing actually clicks, because traces are exactly the kind of artefact humans hate to read manually. A distributed trace with 400 spans across 12 services is a structured object pretending to be readable text, and “structured object that looks like prose” is the model’s home turf. Honeycomb’s Query Assistant is the canonical example. You type “why are checkout requests slower than yesterday for users in the EU?” and it builds you a real Honeycomb query against your actual data. Crucially, it doesn’t try to give you an answer; it gives you a query, which you can edit, run, and reason about. That’s a sane separation of concerns: the AI handles the translation from English to the platform’s query language, and the human keeps the judgment call. You can build the same shape on top of any tracing system. The trick is to give the model the schema of your spans, not the spans themselves, in the system prompt. Service names, attribute keys, common values. Then let it construct queries. If you skip this and just paste raw traces into a chat window, you’ll get plausible-sounding garbage about services that don’t exist in your stack. Tip http.route, db.statement, messaging.system, etc.), the AI will struggle no matter how good the model is. OpenTelemetry’s semantic conventions exist for a reason; if your team is mid-migration, adopting them is the single highest-leverage prep work for AI-assisted debugging.

Pointing an LLM at an error message and asking it to explain is the lowest-effort, highest-payoff use of AI in incident response. New engineers in particular get more from it than from a dozen Stack Overflow tabs. “What does EAI_AGAIN mean? When does it usually fire in Node?” gets answered in seconds, with the correct mental model attached. The danger is that errors are also where hallucinations look most believable. An invented Postgres error code, a non-existent NGINX flag, a confidently described environment variable that the runtime has never heard of: these come out of LLMs at unpredictable rates, and they’re the most expensive kind of wrong because they read like they could be right. The defensive habit: when the model tells you a flag exists or a config option behaves a certain way, you check the upstream docs before you reach for it. Always. Even at 3am. Especially at 3am. This is also where you start to see the trade-off between leaning on AI and leaning on your team’s collective memory. A senior engineer who’s been on your stack for five years has a mental index of “errors that show up around full-disk events” and “errors that mean the load balancer is health-checking weirdly.” That index is local, weird, and irreplaceable. An LLM that’s never seen your stack only has the general version of that knowledge. The combination, feeding the LLM your last hundred postmortems via retrieval and letting it pattern-match against them, is what closes that gap. The fun question isn’t “can the model tell me what’s wrong.” It’s “can the model give me three hypotheses ranked by plausibility, with the test I’d run to falsify each one?” That framing changes the prompt and it changes the answer. Instead of one confident-sounding root cause, you get a small portfolio of possibilities, each with a check. “Hypothesis 1: connection pool exhaustion in payments-svc. Test: query pg_stat_activity for active connections on the payments DB right now.” “Hypothesis 2: upstream rate limit on the Stripe webhook. Test: check the stripe_webhook_rejected_total metric over the last 30 minutes.” And so on. Two things make this work in practice. First: you tell the model, in the system prompt, that you want hypotheses with falsification tests, ranked from cheapest-to-check to most expensive. Models are biased toward sounding confident, and an explicit instruction to enumerate alternatives counteracts that. Second: you keep the human as the one who picks which hypothesis to chase. The AI is a brainstorming partner, not a decision-maker. This is the same instinct that makes good incident commanders ask “what would change your mind about your current theory?” The LLM is just a fast, tireless source of devil’s-advocate hypotheses. A technique worth borrowing here is self-consistency prompting (from Wang et al.’s 2022 paper “Self-Consistency Improves Chain of Thought Reasoning in Language Models”). The mechanism is simple: ask the model the same question several times, throw out the answers that disagree with each other, keep the consistent middle. Applied to incident response, you sample a handful of independent hypothesis sets and trust the ones that keep recurring. It’s a cheap way to filter out the model’s one-off confident guesses. It buys real reliability, and you can build it into your own pipeline in a weekend. Here’s the unsexy claim that holds the whole thing together: AI is only as good at debugging as your runbooks are. The model doesn’t know your on-call escalation paths. It doesn’t know that your team’s convention is to drain the affected pod before SSHing in. It doesn’t know that the “restart the worker” command in your README is wrong and the real command lives in a Notion page from 2024. If you want the LLM to operate as a teammate during an incident, you have to feed it the same context a new hire would get during their second-week shadow rotation. The pattern that works: Runbooks live as structured markdown in a single index. Title, symptoms, decision tree, commands, escalation. The model retrieves the matching runbook by symptom and quotes the steps verbatim; it doesn’t paraphrase them, because paraphrased commands are how outages get worse. Each runbook step has a “safe to run unattended” flag. Read-only diagnostics (kubectl get pods, pg_stat_activity queries) can be run by the agent. Mutating actions (kubectl rollout restart, deletes, scaling changes) require a human to approve. This is the boundary that keeps you from waking up to an AI that decided to “fix” production at 4am. Every closed incident feeds back. The postmortem, the actual root cause, the timeline: they get embedded and pushed to the retrieval store. Six months later, the same symptom comes back, and the model can say “this looks like INC-2418 from January, here’s what was different about it.” That memory is what turns a tool into a teammate. This is also where the marketing and the reality diverge. Vendors talk about “autonomous remediation”: the agent detects an issue and applies the fix without human approval. The technology is real for narrow cases (autoscaling rules, restarting a known-bad pod with a known-good config). The technology is not real for the long tail. Be conservative about which steps you let the agent execute. The cost of a wrong autonomous remediation is much higher than the cost of a slightly slower investigation. Warning The honest version of all this: AI doesn’t make incidents stop happening. It doesn’t replace the engineer who knows where the bodies are buried in your codebase. What it changes is the shape of the first ten minutes: the part where one tired human has to load the entire system into their head, scan four dashboards, and form a theory. A well-wired AI partner does that part in parallel with you. By the time you’ve finished your coffee and opened your tracing UI, you have three ranked hypotheses, the queries to verify each one, the matching runbook from the last time this symptom showed up, and a summary of what’s changed in the last 48 hours of deploys. You still do the thinking. You still make the call. But you start from minute ten instead of minute zero, and that compounds across a year of on-call rotations into a meaningfully less brutal job. The teams that are getting this right in 2026 share a few habits: their logs are structured, their traces follow OpenTelemetry semantic conventions, their runbooks are written down and versioned, their postmortems get embedded for retrieval, and they treat the AI as an assistant that needs supervision rather than a senior engineer that’s always right. None of those habits are exotic. They’re just the same hygiene that makes any debugging tool more useful, AI included. Originally published at nazarboyko.com.

0 views
Back to Blog

Related posts

Read more »