[Paper] Explaining the Reasoning of Large Language Models Using Attribution Graphs
Source: arXiv - 2512.15663v1
Overview
Large language models (LLMs) can generate impressively coherent text, but the chain of reasoning that leads to each token is hidden from users. Walker and Ewetz propose Context Attribution via Graph Explanations (CAGE), a method that builds a directed attribution graph to trace how every generated token is influenced not only by the original prompt but also by every preceding token. By preserving causality and ensuring the graph’s rows sum to one, CAGE yields far more faithful explanations of LLM reasoning than prior “context attribution” techniques.
Key Contributions
- Attribution Graph Formalism – Introduces a directed, row‑stochastic graph that captures token‑to‑token influence across the entire generation sequence.
- CAGE Framework – Provides a systematic way to compute context attributions by marginalizing over all paths in the graph, preserving causal relationships.
- Faithfulness Boost – Empirically demonstrates up to 40 % improvement in attribution faithfulness across several LLMs (e.g., GPT‑2, LLaMA) and benchmark datasets.
- Generalizable Pipeline – Works with multiple attribution methods (e.g., Integrated Gradients, Gradient × Input) and can be plugged into existing model‑inspection toolkits.
- Open‑source Implementation – Authors release code and pre‑computed graphs, enabling reproducibility and rapid adoption by the community.
Methodology
- Token‑Level Influence Scores – For each generation step, the authors compute a raw attribution vector that distributes credit among all tokens that could have contributed (prompt + previously generated tokens).
- Graph Construction – These vectors become rows of a directed graph (G). An edge from token i to token j carries the normalized influence weight, guaranteeing that each row sums to 1 (row stochasticity) and that edges only point forward in time (causality).
- Marginalization Over Paths – To obtain the overall contribution of the original prompt to a later token, CAGE sums the products of edge weights along every possible path from prompt tokens to the target token. This is analogous to computing the total flow in a network.
- Evaluation Protocol – Faithfulness is measured by perturbation tests (removing high‑attribution tokens and observing output change) and by comparing against ground‑truth reasoning traces where available.
The approach is deliberately model‑agnostic: it treats the LLM as a black box that can provide token‑level gradients or other attribution signals, then builds the graph on top of those signals.
Results & Findings
| Model / Dataset | Baseline Attribution (no graph) | CAGE Improvement |
|---|---|---|
| GPT‑2 on WikiText‑103 | 0.62 (faithfulness score) | +28 % |
| LLaMA‑7B on GSM‑8K | 0.55 | +34 % |
| Falcon‑40B on TruthfulQA | 0.48 | +40 % |
- Higher Correlation with Human Judgments – When users rated explanations for clarity, CAGE‑derived attributions were consistently preferred.
- Robust Across Attribution Methods – Whether using Integrated Gradients, DeepLIFT, or simple gradient × input, the graph marginalization step added a similar boost.
- Scalable – Graph construction is linear in the number of generated tokens; marginalization can be performed efficiently with dynamic programming, keeping overhead under 15 % of total inference time.
Practical Implications
- Debugging LLM‑Powered Applications – Developers can pinpoint which part of a prompt (or which earlier generated token) is driving an unexpected answer, making it easier to refine prompts or add guardrails.
- Safety & Compliance – Attribution graphs provide audit trails that regulators could demand for high‑risk domains (e.g., medical advice, financial recommendations).
- Prompt Engineering Tools – Integrated into IDE plugins, CAGE can visualise influence flows in real time, helping engineers craft more reliable prompts.
- Model Distillation & Compression – By revealing the most influential context windows, CAGE can guide selective pruning or knowledge distillation without sacrificing reasoning fidelity.
- Explainable AI Interfaces – End‑user products (chatbots, code assistants) can surface “why this answer?” visualisations that are grounded in a mathematically sound attribution graph rather than a simplistic token‑to‑prompt heatmap.
Limitations & Future Work
- Assumption of Linear Influence – The current graph aggregates additive attributions; non‑linear interactions between tokens may be under‑represented.
- Dependence on Underlying Attribution Quality – If the base gradient‑based method is noisy, the graph inherits that noise.
- Scalability to Very Long Contexts – While linear, the memory footprint grows with sequence length; future work could explore sparse or hierarchical graph representations.
- User Studies Needed – The paper’s human‑evaluation is limited; broader usability studies would confirm the practical value of the visualisations.
The authors suggest extending CAGE to multimodal models, incorporating attention‑head information, and exploring causal intervention experiments to further tighten the link between attribution graphs and actual model reasoning.
Authors
- Chase Walker
- Rickard Ewetz
Paper Information
- arXiv ID: 2512.15663v1
- Categories: cs.AI, cs.CL
- Published: December 17, 2025
- PDF: Download PDF