[Paper] From Flat Logs to Causal Graphs: Hierarchical Failure Attribution for LLM-based Multi-Agent Systems
Source: arXiv - 2602.23701v1
Overview
The paper introduces CHIEF, a new framework that turns the tangled execution logs of large‑language‑model (LLM) powered multi‑agent systems into a hierarchical causal graph. By doing so, it makes it possible to pinpoint exactly why a multi‑agent workflow failed, rather than merely surfacing the symptom at the end of a flat log. This capability is crucial as developers increasingly rely on LLM‑driven agents for complex, coordinated tasks.
Key Contributions
- Hierarchical Causal Graph Construction – Converts flat, noisy logs into a multi‑level graph that captures intra‑agent actions, inter‑agent communications, and higher‑level task dependencies.
- Oracle‑Guided Backtracking – Uses a combination of real and synthesized “virtual” oracles to prune the search space efficiently while tracing failure propagation.
- Progressive Counterfactual Screening – Applies a systematic counterfactual analysis to separate true root causes from downstream side‑effects.
- Empirical Superiority – Demonstrates state‑of‑the‑art performance on the Who&When benchmark, beating eight strong baselines on both agent‑level and step‑level attribution accuracy.
- Ablation Insights – Shows that each module (graph construction, oracle‑guided pruning, counterfactual screening) contributes measurably to overall performance.
Methodology
- Log Ingestion & Parsing – Raw execution traces (messages, tool calls, state updates) are first tokenized and annotated with timestamps and agent identifiers.
- Hierarchical Graph Building
- Leaf Nodes represent atomic actions (e.g., a single LLM response).
- Intermediate Nodes group actions belonging to the same logical sub‑task or dialogue turn.
- Root Nodes capture the overall mission objective.
Edges encode causal dependencies (e.g., “Agent A’s request → Agent B’s response”).
- Oracle‑Guided Backtracking
- A real oracle (the original LLM) can be queried to verify whether a hypothesized sub‑graph leads to success.
- Virtual oracles are lightweight surrogate models trained on a small set of replayed episodes; they answer “what‑if” questions much faster.
- The system backtracks from failure points, discarding branches that virtual oracles deem irrelevant, dramatically shrinking the search space.
- Counterfactual Attribution
- For each candidate root cause, the framework performs a progressive intervention: it re‑executes the graph with that node altered (or removed) while keeping everything else fixed.
- If the failure disappears, the node is marked as a true cause; otherwise, it is classified as a propagated symptom.
The whole pipeline runs automatically after a failure, requiring no manual annotation or expensive full‑system replays.
Results & Findings
| Metric | CHIEF | Best Baseline | Relative Gain |
|---|---|---|---|
| Agent‑level attribution accuracy | 84.2 % | 71.5 % | +12.7 % |
| Step‑level attribution accuracy | 78.9 % | 65.3 % | +13.6 % |
| Average oracle queries per attribution | 3.2 | 12.7 | –75 % |
- Higher precision in identifying the exact agent responsible for a failure, which is critical for debugging multi‑agent pipelines.
- Fewer oracle calls thanks to the virtual‑oracle pruning, making the approach practical for large‑scale deployments.
- Ablation studies reveal that removing the hierarchical graph drops accuracy by ~9 %, while skipping counterfactual screening adds ~6 % false positives.
Practical Implications
- Debug‑as‑a‑Service: Developers can integrate CHIEF into their CI/CD pipelines to automatically generate failure reports that include a causal graph and root‑cause explanations.
- Improved Reliability: By exposing hidden inter‑agent dependencies, teams can redesign coordination protocols before costly production incidents occur.
- Cost Savings: The virtual‑oracle layer reduces the need for full re‑execution of multi‑agent scenarios, cutting compute bills for large LLM deployments.
- Compliance & Auditing: A structured causal graph offers a clear audit trail, useful for regulatory environments where AI decision‑making must be explainable.
- Tooling Ecosystem: CHIEF’s graph format can be visualized with existing DAG tools (e.g., Graphviz, Mermaid), enabling quick visual inspection for non‑technical stakeholders.
Limitations & Future Work
- Domain Dependence: The current implementation is tuned for the Who&When benchmark; adapting to domains with richer tool‑use (e.g., code generation, robotics) may require custom node types.
- Oracle Quality: The accuracy of virtual oracles hinges on the quality and diversity of the replayed episodes; sparse failure data could degrade pruning effectiveness.
- Scalability of Counterfactuals: While progressive screening reduces the number of interventions, extremely large graphs (thousands of nodes) could still incur noticeable latency.
- Future Directions: The authors suggest extending CHIEF to online attribution (real‑time debugging), incorporating reinforcement‑learning to automatically suggest corrective actions, and exploring cross‑modal logs (e.g., visual sensor streams) for embodied multi‑agent systems.
Authors
- Yawen Wang
- Wenjie Wu
- Junjie Wang
- Qing Wang
Paper Information
- arXiv ID: 2602.23701v1
- Categories: cs.AI, cs.SE
- Published: February 27, 2026
- PDF: Download PDF