[Paper] Agentic Structured Graph Traversal for Root Cause Analysis of Code-related Incidents in Cloud Applications
Source: arXiv - 2512.22113v1
Overview
Cloud‑native applications run thousands of microservices, and when an incident strikes, pinpointing the exact piece of code or configuration that caused the outage can take hours—costing millions of dollars. The paper introduces PRAXIS, a novel orchestrator that lets a large language model (LLM) “walk” through two complementary graphs (service‑level and code‑level) to automatically perform root‑cause analysis (RCA). By turning the LLM into a graph‑traversal policy, PRAXIS dramatically speeds up and sharpens incident diagnosis.
Key Contributions
- Dual‑graph representation: Combines a Service Dependency Graph (SDG) for microservice interactions with a “hammock‑block” Program Dependence Graph (PDG) that captures fine‑grained code dependencies inside each service.
- LLM‑driven traversal policy: Uses prompting techniques (inspired by ReAct) to let the LLM decide which node to explore next, effectively turning the model into an autonomous agent that reasons over the graphs.
- PRAXIS orchestrator: A lightweight runtime that manages the LLM, graph queries, and external data (logs, traces) to produce a concise, human‑readable RCA report.
- Empirical gains: On a curated benchmark of 30 real‑world cloud incidents, PRAXIS achieves up to 3.1× higher RCA accuracy and 3.8× lower token usage compared with state‑of‑the‑art ReAct baselines.
- Open benchmark: The authors release the incident dataset as a new RCA benchmark for the research community.
Methodology
-
Graph Construction
- Service Dependency Graph (SDG): Nodes are microservices; directed edges represent RPC calls, message queues, or shared storage. The SDG is built from service‑mesh telemetry (e.g., OpenTelemetry) and deployment manifests.
- Hammock‑Block Program Dependence Graph (PDG): For each service, static analysis extracts control‑ and data‑flow dependencies between “hammocks” (clusters of statements) and individual code blocks, yielding a compact PDG that still preserves the causal structure needed for debugging.
-
Agentic Traversal
- The LLM receives a structured prompt containing the incident description, the current graph node, and a short “action space” (e.g., move to dependent service X, inspect function Y in PDG).
- Using a ReAct‑style loop, the model outputs an action (which node to visit) and an observation (e.g., log snippet, error message).
- The orchestrator updates the context, fetches the next graph slice, and repeats until a termination condition (confidence threshold or max steps) is met.
-
RCA Synthesis
- Once the traversal converges, the LLM compiles the visited nodes, observations, and inferred causal chain into a short, developer‑friendly root‑cause explanation, optionally linking to the offending code commit or configuration file.
Results & Findings
| Metric | PRAXIS | ReAct‑Baseline |
|---|---|---|
| RCA Accuracy (top‑1) | 78 % (↑3.1×) | 25 % |
| Average Tokens per Incident | 1.2 k (↓3.8×) | 4.6 k |
| Mean Traversal Steps | 7 | 22 |
| Time to Diagnosis | ~45 s (including API latency) | ~3 min |
- Higher precision stems from the PDG’s ability to prune irrelevant code paths early, while the SDG guides the LLM toward the most suspicious services.
- Token savings are achieved because the LLM only receives the minimal sub‑graph needed for each step, avoiding the “prompt bloat” that plagues monolithic ReAct approaches.
- The benchmark shows PRAXIS works across diverse incident types (null‑pointer crashes, mis‑configurations, version mismatches), indicating good generality.
Practical Implications
- Faster MTTR (Mean Time to Repair): Integrating PRAXIS into SRE toolchains could shave minutes—or even hours—off incident resolution, directly translating to cost savings for high‑scale cloud providers.
- Developer‑friendly RCA reports: The generated explanations include exact file/line references, making hand‑off from SRE to dev teams seamless.
- Scalable automation: Because the orchestrator only pulls the necessary graph fragments, it can run on modest hardware and scale to thousands of services without overwhelming LLM token limits.
- Continuous improvement loop: Incident outcomes can be fed back to refine the PDG (e.g., adding dynamic call‑graph data) and the LLM prompts, creating a self‑learning RCA assistant.
- Potential for CI/CD integration: PRAXIS could be invoked automatically on failed deployments or post‑mortem pipelines, catching root causes before they surface in production.
Limitations & Future Work
- Static analysis reliance: The PDG is built from static code analysis, which may miss runtime‑generated code paths (e.g., reflection, plugins).
- Graph freshness: In rapidly evolving microservice ecosystems, keeping the SDG and PDG up‑to‑date requires frequent re‑generation, adding operational overhead.
- LLM hallucination risk: Although the structured traversal reduces hallucinations, the model can still produce plausible‑but‑incorrect causal links if the underlying telemetry is noisy or incomplete.
- Benchmark size: The evaluation uses 30 incidents; larger, more diverse datasets are needed to fully validate robustness.
- Future directions proposed by the authors include:
- Augmenting PDGs with dynamic profiling data.
- Exploring multi‑LLM ensembles for ensemble reasoning.
- Extending PRAXIS to handle configuration‑only incidents where code traces are sparse.
Authors
- Shengkun Cui
- Rahul Krishna
- Saurabh Jha
- Ravishankar K. Iyer
Paper Information
- arXiv ID: 2512.22113v1
- Categories: cs.DC, cs.AI, cs.SE
- Published: December 26, 2025
- PDF: Download PDF