[Paper] Causality is Key for Interpretability Claims to Generalise
Source: arXiv - 2602.16698v1
Overview
Interpretability work on large language models (LLMs) has produced many fascinating insights, but the field still wrestles with two recurring problems: results that don’t hold up when the model or data change, and causal explanations that go beyond what the evidence actually supports. This paper argues that a rigorous causal‑inference framework—specifically Pearl’s causal hierarchy—provides the missing scaffolding for turning activation‑level observations into generalizable interpretability claims.
Key Contributions
- Causal framing of interpretability – Shows how Pearl’s three‑level causal hierarchy (association → intervention → counterfactual) maps onto common interpretability techniques.
- Clarification of what each method can legitimately claim – Distinguishes between associative findings (e.g., “neuron X fires when the model mentions dates”), interventional evidence (e.g., “ablating neuron X reduces the probability of date tokens”), and counterfactual statements (e.g., “if neuron X had been active, the model would have generated a different answer”).
- Operationalisation via Causal Representation Learning (CRL) – Demonstrates how CRL can be used to recover latent causal variables from hidden activations, together with the assumptions required for each recovery.
- Diagnostic framework for practitioners – Provides a checklist that aligns research questions, chosen methods, and evaluation metrics with the appropriate causal level, helping avoid over‑reaching claims.
- Empirical illustration on LLMs – Applies the framework to a set of standard interpretability probes (activation patching, neuron ablation, probing classifiers) and shows where the evidence stops short of supporting counterfactual conclusions.
Methodology
-
Mapping interpretability tools onto Pearl’s hierarchy
- Associations: Correlational analyses such as probing classifiers or activation‑ranking.
- Interventions: Controlled edits to the model (ablation, activation patching, weight surgery) and measuring the resulting change in a behavioural metric (e.g., token‑probability shift).
- Counterfactuals: Hypothetical “what‑if” scenarios that require knowledge of the underlying structural causal model (SCM) – typically unavailable for LLMs.
-
Causal Representation Learning (CRL) pipeline
- Define a set of latent variables that are hypothesised to capture high‑level concepts (e.g., “sentiment”, “syntax”).
- Train an encoder that maps hidden activations to these latents while enforcing identifiability constraints (e.g., independent mechanisms, non‑Gaussian noise).
- Validate the learned latents by performing interventional experiments (e.g., intervene on a latent and observe downstream token changes).
-
Diagnostic checklist
- For a given claim, the checklist asks:
a) Which causal level does the claim belong to?
b) Does the chosen method provide evidence at that level?
c) What additional data or assumptions (e.g., access to ground‑truth interventions) are needed to move up the hierarchy?
- For a given claim, the checklist asks:
The authors illustrate the pipeline on a 6‑B parameter transformer, focusing on concepts like “coreference” and “numerical reasoning”.
Results & Findings
- Associations are easy but fragile – Probing classifiers reliably detect statistical regularities, yet these do not survive distribution shifts (e.g., prompting the model with a different style).
- Interventions give bounded causal insight – Ablation of a “syntax neuron” consistently reduces syntactic accuracy on a held‑out set, confirming an interventional effect. However, the effect size varies with prompt length, indicating limited invariance.
- Counterfactual claims remain unverifiable – The authors attempted to infer “what the model would have output if a latent concept had been different” using CRL‑derived latents. Without an explicit SCM or controlled supervision, the counterfactual predictions diverged sharply from actual model behaviour when tested on a small set of manually crafted prompts.
- CRL improves identifiability under strong assumptions – When the encoder is constrained to respect known modularity (e.g., separate pathways for syntax vs. semantics), the recovered latents align better with the intended concepts, but only when the training data includes targeted interventions.
Overall, the study confirms that interventional evidence is attainable and useful, while counterfactual generalisation requires additional supervision or structural knowledge that most current interpretability pipelines lack.
Practical Implications
- Tool selection becomes principled – Developers can now match their interpretability goal (e.g., debugging a specific failure mode) to the appropriate method: use probing for hypothesis generation, interventions for debugging, and avoid counterfactual language unless they have a vetted SCM.
- Safer model auditing – By explicitly stating the causal level of a claim, auditors can avoid over‑promising on “why” a model behaved a certain way, reducing legal and compliance risk.
- Guidance for building explainable APIs – When exposing model explanations to end‑users, services can limit themselves to interventional explanations (e.g., “turning off this feature would change the answer by X%”) which are empirically backed, rather than speculative “if‑then” narratives.
- Roadmap for research tooling – The diagnostic framework can be baked into libraries (e.g.,
transformers-interp) to automatically flag when a user attempts a counterfactual claim without sufficient data, prompting them to collect intervention data or adjust expectations.
Limitations & Future Work
- Assumption‑heavy CRL – Recovering identifiable latents hinges on strong structural assumptions (independent mechanisms, known modularity) that may not hold for all LLM architectures.
- Scale of experiments – The empirical validation is limited to a single 6‑B model and a handful of concepts; larger models and more diverse tasks could reveal different dynamics.
- Counterfactual estimation remains open – The paper highlights the gap but does not provide a concrete method for building SCMs for LLMs; future work could explore hybrid approaches combining mechanistic circuit analysis with CRL.
- User‑study missing – The practical impact of the diagnostic checklist on real‑world interpretability workflows has not been measured; a user study with ML engineers would strengthen the claim of usability.
Bottom line: By anchoring interpretability research in a well‑established causal framework, this work offers a clear path to more reliable, generalizable explanations for LLMs—provided we respect the limits of what our current tools can actually prove.
Authors
- Shruti Joshi
- Aaron Mueller
- David Klindt
- Wieland Brendel
- Patrik Reizinger
- Dhanya Sridhar
Paper Information
- arXiv ID: 2602.16698v1
- Categories: cs.LG
- Published: February 18, 2026
- PDF: Download PDF