[Paper] From Flat Logs to Causal Graphs: Hierarchical Failure Attribution for LLM-based Multi-Agent Systems

Published: (February 27, 2026 at 01:08 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.23701v1

Overview

The paper introduces CHIEF, a new framework that turns the tangled execution logs of large‑language‑model (LLM) powered multi‑agent systems into a hierarchical causal graph. By doing so, it makes it possible to pinpoint exactly why a multi‑agent workflow failed, rather than merely surfacing the symptom at the end of a flat log. This capability is crucial as developers increasingly rely on LLM‑driven agents for complex, coordinated tasks.

Key Contributions

  • Hierarchical Causal Graph Construction – Converts flat, noisy logs into a multi‑level graph that captures intra‑agent actions, inter‑agent communications, and higher‑level task dependencies.
  • Oracle‑Guided Backtracking – Uses a combination of real and synthesized “virtual” oracles to prune the search space efficiently while tracing failure propagation.
  • Progressive Counterfactual Screening – Applies a systematic counterfactual analysis to separate true root causes from downstream side‑effects.
  • Empirical Superiority – Demonstrates state‑of‑the‑art performance on the Who&When benchmark, beating eight strong baselines on both agent‑level and step‑level attribution accuracy.
  • Ablation Insights – Shows that each module (graph construction, oracle‑guided pruning, counterfactual screening) contributes measurably to overall performance.

Methodology

  1. Log Ingestion & Parsing – Raw execution traces (messages, tool calls, state updates) are first tokenized and annotated with timestamps and agent identifiers.
  2. Hierarchical Graph Building
    • Leaf Nodes represent atomic actions (e.g., a single LLM response).
    • Intermediate Nodes group actions belonging to the same logical sub‑task or dialogue turn.
    • Root Nodes capture the overall mission objective.
      Edges encode causal dependencies (e.g., “Agent A’s request → Agent B’s response”).
  3. Oracle‑Guided Backtracking
    • A real oracle (the original LLM) can be queried to verify whether a hypothesized sub‑graph leads to success.
    • Virtual oracles are lightweight surrogate models trained on a small set of replayed episodes; they answer “what‑if” questions much faster.
    • The system backtracks from failure points, discarding branches that virtual oracles deem irrelevant, dramatically shrinking the search space.
  4. Counterfactual Attribution
    • For each candidate root cause, the framework performs a progressive intervention: it re‑executes the graph with that node altered (or removed) while keeping everything else fixed.
    • If the failure disappears, the node is marked as a true cause; otherwise, it is classified as a propagated symptom.

The whole pipeline runs automatically after a failure, requiring no manual annotation or expensive full‑system replays.

Results & Findings

MetricCHIEFBest BaselineRelative Gain
Agent‑level attribution accuracy84.2 %71.5 %+12.7 %
Step‑level attribution accuracy78.9 %65.3 %+13.6 %
Average oracle queries per attribution3.212.7–75 %
  • Higher precision in identifying the exact agent responsible for a failure, which is critical for debugging multi‑agent pipelines.
  • Fewer oracle calls thanks to the virtual‑oracle pruning, making the approach practical for large‑scale deployments.
  • Ablation studies reveal that removing the hierarchical graph drops accuracy by ~9 %, while skipping counterfactual screening adds ~6 % false positives.

Practical Implications

  • Debug‑as‑a‑Service: Developers can integrate CHIEF into their CI/CD pipelines to automatically generate failure reports that include a causal graph and root‑cause explanations.
  • Improved Reliability: By exposing hidden inter‑agent dependencies, teams can redesign coordination protocols before costly production incidents occur.
  • Cost Savings: The virtual‑oracle layer reduces the need for full re‑execution of multi‑agent scenarios, cutting compute bills for large LLM deployments.
  • Compliance & Auditing: A structured causal graph offers a clear audit trail, useful for regulatory environments where AI decision‑making must be explainable.
  • Tooling Ecosystem: CHIEF’s graph format can be visualized with existing DAG tools (e.g., Graphviz, Mermaid), enabling quick visual inspection for non‑technical stakeholders.

Limitations & Future Work

  • Domain Dependence: The current implementation is tuned for the Who&When benchmark; adapting to domains with richer tool‑use (e.g., code generation, robotics) may require custom node types.
  • Oracle Quality: The accuracy of virtual oracles hinges on the quality and diversity of the replayed episodes; sparse failure data could degrade pruning effectiveness.
  • Scalability of Counterfactuals: While progressive screening reduces the number of interventions, extremely large graphs (thousands of nodes) could still incur noticeable latency.
  • Future Directions: The authors suggest extending CHIEF to online attribution (real‑time debugging), incorporating reinforcement‑learning to automatically suggest corrective actions, and exploring cross‑modal logs (e.g., visual sensor streams) for embodied multi‑agent systems.

Authors

  • Yawen Wang
  • Wenjie Wu
  • Junjie Wang
  • Qing Wang

Paper Information

  • arXiv ID: 2602.23701v1
  • Categories: cs.AI, cs.SE
  • Published: February 27, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »