[Paper] From Flat Logs to Causal Graphs: Hierarchical Failure Attribution for LLM-based Multi-Agent Systems

Published: 3 days ago (February 27, 2026 at 01:08 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.23701v1

Overview

The paper introduces CHIEF, a new framework that turns the tangled execution logs of large‑language‑model (LLM) powered multi‑agent systems into a hierarchical causal graph. By doing so, it makes it possible to pinpoint exactly why a multi‑agent workflow failed, rather than merely surfacing the symptom at the end of a flat log. This capability is crucial as developers increasingly rely on LLM‑driven agents for complex, coordinated tasks.

Key Contributions

Hierarchical Causal Graph Construction – Converts flat, noisy logs into a multi‑level graph that captures intra‑agent actions, inter‑agent communications, and higher‑level task dependencies.
Oracle‑Guided Backtracking – Uses a combination of real and synthesized “virtual” oracles to prune the search space efficiently while tracing failure propagation.
Progressive Counterfactual Screening – Applies a systematic counterfactual analysis to separate true root causes from downstream side‑effects.
Empirical Superiority – Demonstrates state‑of‑the‑art performance on the Who&When benchmark, beating eight strong baselines on both agent‑level and step‑level attribution accuracy.
Ablation Insights – Shows that each module (graph construction, oracle‑guided pruning, counterfactual screening) contributes measurably to overall performance.

Methodology

Log Ingestion & Parsing – Raw execution traces (messages, tool calls, state updates) are first tokenized and annotated with timestamps and agent identifiers.
Hierarchical Graph Building
- Leaf Nodes represent atomic actions (e.g., a single LLM response).
- Intermediate Nodes group actions belonging to the same logical sub‑task or dialogue turn.
- Root Nodes capture the overall mission objective.
  Edges encode causal dependencies (e.g., “Agent A’s request → Agent B’s response”).
Oracle‑Guided Backtracking
- A real oracle (the original LLM) can be queried to verify whether a hypothesized sub‑graph leads to success.
- Virtual oracles are lightweight surrogate models trained on a small set of replayed episodes; they answer “what‑if” questions much faster.
- The system backtracks from failure points, discarding branches that virtual oracles deem irrelevant, dramatically shrinking the search space.
Counterfactual Attribution
- For each candidate root cause, the framework performs a progressive intervention: it re‑executes the graph with that node altered (or removed) while keeping everything else fixed.
- If the failure disappears, the node is marked as a true cause; otherwise, it is classified as a propagated symptom.

The whole pipeline runs automatically after a failure, requiring no manual annotation or expensive full‑system replays.

Results & Findings

Metric	CHIEF	Best Baseline	Relative Gain
Agent‑level attribution accuracy	84.2 %	71.5 %	+12.7 %
Step‑level attribution accuracy	78.9 %	65.3 %	+13.6 %
Average oracle queries per attribution	3.2	12.7	–75 %

Higher precision in identifying the exact agent responsible for a failure, which is critical for debugging multi‑agent pipelines.
Fewer oracle calls thanks to the virtual‑oracle pruning, making the approach practical for large‑scale deployments.
Ablation studies reveal that removing the hierarchical graph drops accuracy by ~9 %, while skipping counterfactual screening adds ~6 % false positives.

Practical Implications

Debug‑as‑a‑Service: Developers can integrate CHIEF into their CI/CD pipelines to automatically generate failure reports that include a causal graph and root‑cause explanations.
Improved Reliability: By exposing hidden inter‑agent dependencies, teams can redesign coordination protocols before costly production incidents occur.
Cost Savings: The virtual‑oracle layer reduces the need for full re‑execution of multi‑agent scenarios, cutting compute bills for large LLM deployments.
Compliance & Auditing: A structured causal graph offers a clear audit trail, useful for regulatory environments where AI decision‑making must be explainable.
Tooling Ecosystem: CHIEF’s graph format can be visualized with existing DAG tools (e.g., Graphviz, Mermaid), enabling quick visual inspection for non‑technical stakeholders.

Limitations & Future Work

Domain Dependence: The current implementation is tuned for the Who&When benchmark; adapting to domains with richer tool‑use (e.g., code generation, robotics) may require custom node types.
Oracle Quality: The accuracy of virtual oracles hinges on the quality and diversity of the replayed episodes; sparse failure data could degrade pruning effectiveness.
Scalability of Counterfactuals: While progressive screening reduces the number of interventions, extremely large graphs (thousands of nodes) could still incur noticeable latency.
Future Directions: The authors suggest extending CHIEF to online attribution (real‑time debugging), incorporating reinforcement‑learning to automatically suggest corrective actions, and exploring cross‑modal logs (e.g., visual sensor streams) for embodied multi‑agent systems.

Authors

Yawen Wang
Wenjie Wu
Junjie Wang
Qing Wang

Paper Information

arXiv ID: 2602.23701v1
Categories: cs.AI, cs.SE
Published: February 27, 2026
PDF: Download PDF

[Paper] From Flat Logs to Causal Graphs: Hierarchical Failure Attribution for LLM-based Multi-Agent Systems

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Mode Seeking meets Mean Seeking for Fast Long Video Generation

[Paper] DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science

[Paper] Do LLMs Benefit From Their Own Words?

[Paper] CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation