[Paper] Causality is Key for Interpretability Claims to Generalise

Published: 2 months ago (February 18, 2026 at 01:45 PM EST)

5 min read

Source: arXiv

Source: arXiv

Overview

Interpretability work on large language models (LLMs) has produced many fascinating insights, but the field still wrestles with two recurring problems:

Fragile results – findings often don’t hold up when the model or data change.
Over‑reaching causal claims – explanations sometimes go beyond what the evidence actually supports.

This paper argues that a rigorous causal‑inference framework—specifically Pearl’s causal hierarchy—provides the missing scaffolding for turning activation‑level observations into generalizable interpretability claims.

Key Contributions

Causal framing of interpretability – Shows how Pearl’s three‑level causal hierarchy (association → intervention → counterfactual) maps onto common interpretability techniques.
Clarification of what each method can legitimately claim – Distinguishes between:
- Associative findings (e.g., “neuron X fires when the model mentions dates”);
- Interventional evidence (e.g., “ablating neuron X reduces the probability of date tokens”);
- Counterfactual statements (e.g., “if neuron X had been active, the model would have generated a different answer”).
Operationalisation via Causal Representation Learning (CRL) – Demonstrates how CRL can be used to recover latent causal variables from hidden activations, together with the assumptions required for each recovery.
Diagnostic framework for practitioners – Provides a checklist that aligns research questions, chosen methods, and evaluation metrics with the appropriate causal level, helping avoid over‑reaching claims.
Empirical illustration on LLMs – Applies the framework to a set of standard interpretability probes (activation patching, neuron ablation, probing classifiers) and shows where the evidence stops short of supporting counterfactual conclusions.

Methodology

1. Mapping interpretability tools onto Pearl’s hierarchy

Associations – Correlational analyses such as probing classifiers or activation‑ranking.
Interventions – Controlled edits to the model (ablation, activation patching, weight surgery) and measurement of the resulting change in a behavioural metric (e.g., token‑probability shift).
Counterfactuals – Hypothetical “what‑if” scenarios that require knowledge of the underlying structural causal model (SCM) – typically unavailable for LLMs.

2. Causal Representation Learning (CRL) pipeline

Define latent variables that are hypothesised to capture high‑level concepts (e.g., sentiment, syntax).
Train an encoder that maps hidden activations to these latents while enforcing identifiability constraints (e.g., independent mechanisms, non‑Gaussian noise).
Validate the learned latents by performing interventional experiments (e.g., intervene on a latent and observe downstream token changes).

3. Diagnostic checklist

For a given claim, the checklist asks:

a. Which causal level does the claim belong to?
b. Does the chosen method provide evidence at that level?
c. What additional data or assumptions (e.g., access to ground‑truth interventions) are needed to move up the hierarchy?

The authors illustrate the pipeline on a 6‑B‑parameter transformer, focusing on concepts such as coreference and numerical reasoning.

Results & Findings

Associations are easy but fragile – Probing classifiers reliably detect statistical regularities, yet these do not survive distribution shifts (e.g., prompting the model with a different style).
Interventions give bounded causal insight – Ablation of a “syntax neuron” consistently reduces syntactic accuracy on a held‑out set, confirming an interventional effect. However, the effect size varies with prompt length, indicating limited invariance.
Counterfactual claims remain unverifiable – The authors attempted to infer “what the model would have output if a latent concept had been different” using CRL‑derived latents. Without an explicit SCM or controlled supervision, the counterfactual predictions diverged sharply from actual model behaviour when tested on a small set of manually crafted prompts.
CRL improves identifiability under strong assumptions – When the encoder is constrained to respect known modularity (e.g., separate pathways for syntax vs. semantics), the recovered latents align better with the intended concepts, but only when the training data includes targeted interventions.

Take‑away

Interventional evidence is attainable and useful.
Counterfactual generalisation requires additional supervision or structural knowledge that most current interpretability pipelines lack.

Practical Implications

Principled tool selection – Developers can align their interpretability goal (e.g., debugging a specific failure mode) with the appropriate method:
- Probing for hypothesis generation.
- Interventions for concrete debugging.
- Avoid counterfactual language unless a vetted structural causal model (SCM) is available.
Safer model auditing – By explicitly stating the causal level of a claim, auditors can avoid over‑promising on “why” a model behaved a certain way, thereby reducing legal and compliance risk.
Guidance for building explainable APIs – When exposing model explanations to end‑users, services should limit themselves to interventional explanations (e.g., “turning off this feature would change the answer by X %”) that are empirically backed, rather than speculative “if‑then” narratives.
Roadmap for research tooling – The diagnostic framework can be baked into libraries (e.g., transformers‑interp) to automatically flag when a user attempts a counterfactual claim without sufficient data, prompting them to collect intervention data or adjust expectations.

Limitations & Future Work

Assumption‑heavy CRL – Recovering identifiable latents hinges on strong structural assumptions (independent mechanisms, known modularity) that may not hold for all LLM architectures.
Scale of experiments – Empirical validation is limited to a single 6‑B model and a handful of concepts; larger models and more diverse tasks could reveal different dynamics.
Counterfactual estimation remains open – The paper highlights the gap but does not provide a concrete method for building SCMs for LLMs. Future work could explore hybrid approaches that combine mechanistic circuit analysis with CRL.
User‑study missing – The practical impact of the diagnostic checklist on real‑world interpretability workflows has not been measured. A user study with ML engineers would strengthen the claim of usability.

Bottom line: By anchoring interpretability research in a well‑established causal framework, this work offers a clear path to more reliable, generalizable explanations for LLMs—provided we respect the limits of what our current tools can actually prove.

Authors

Aaron Mueller
David Klindt
Dhanya Sridhar
Patrik Reizinger
Shruti Joshi
Wieland Brendel

Paper Information

arXiv ID: 2602.16698v1
Categories: cs.LG
Published: February 18, 2026
PDF: Download PDF