[Paper] Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims

Published: 3 days ago (May 8, 2026 at 01:01 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.08012v1

Overview

The paper Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims critiques a growing trend in mechanistic interpretability research: authors routinely frame their findings in causal language (e.g., “circuits,” “mediators,” “causal abstraction”) without spelling out the hidden assumptions that make those causal claims valid. By auditing ten recent papers, the authors demonstrate that the community lacks a systematic practice for stating identification assumptions, and they propose a concrete disclosure norm to fix the problem.

Key Contributions

Systematic audit of 10 mechanistic‑interpretability papers across four methodological families, revealing a consistent absence of dedicated identification‑assumption sections.
Empirical replication with a two‑coder, 30‑paper sample that confirms the original audit’s findings (the result is robust to coding rules).
Critical analysis of how validation metrics (faithfulness, completeness, monosemanticity, alignment, ablation effects) are mistakenly presented as proof of causality.
Norm proposal: a concise checklist for authors to disclose causal claims, the underlying identification strategy, all required assumptions, and the impact of assumption violations.
Clarification that “validation ≠ identification,” urging the community to treat them as distinct steps in causal inference.

Methodology

Paper selection – The authors curated ten influential mechanistic‑interpretability studies representing four common approaches (circuit analysis, mediator discovery, causal abstraction, and monosemantic probing).
Coding scheme – Two independent human coders examined each paper for:
- Presence of a dedicated “identification assumptions” section.
- Whether validation metrics were used as stand‑alone causal evidence.
- Explicit statement of causal intent.
Replication audit – To test robustness, a second audit of 30 additional papers (selected via keyword search) was performed using the same coding rules. Discrepancies were resolved through discussion, and inter‑coder agreement was reported.
Synthesis – Findings from both audits were aggregated, and patterns were distilled into a proposed disclosure norm.

Results & Findings

Zero papers in the original ten‑paper set contained a separate section that listed identification assumptions.
Validation‑metric substitution was observed in 8/10 papers: authors cited high faithfulness or ablation scores as “evidence of causality” without justifying why those metrics identify the underlying mechanism.
The replication audit (30 papers) showed the same trend (≈ 85 % of papers omitted explicit assumptions), confirming that the issue is pervasive rather than an artifact of the initial sample.
Inter‑coder reliability was high (Cohen’s κ ≈ 0.78), indicating that the coding scheme reliably captured the phenomenon.
The authors’ disclosure norm (claim → strategy → assumptions → stress + counterfactual) was shown to be concise (≈ 3‑4 sentences) yet sufficient to make causal reasoning transparent.

Practical Implications

For developers building interpretability tools – Knowing the exact assumptions behind a “causal” claim helps you decide whether a tool’s output can be trusted for debugging, safety checks, or model‑editing pipelines.
For AI product teams – The norm gives a checklist for internal review processes, ensuring that any causal explanation presented to stakeholders (e.g., regulators, customers) is backed by a clear identification argument.
For open‑source libraries – Implementers can expose the underlying assumptions as metadata (e.g., explanation.causal_assumptions = [...]), making downstream usage more responsible.
For research reproducibility – Explicit assumption disclosure simplifies replication: other teams can test what happens when an assumption is violated, leading to more robust, generalizable interpretability methods.
For policy and compliance – When interpretability claims are used in audits or compliance reports, the proposed norm provides a defensible way to separate “observational validation” from “causal inference,” reducing legal risk.

Limitations & Future Work

Scope of audit – The study focused on papers that already use causal terminology; it may miss subtler cases where causal language is implicit.
Coding granularity – While two coders achieved good agreement, the binary presence/absence of an assumption section could overlook nuanced discussions embedded elsewhere in the text.
Norm adoption – The paper proposes a disclosure checklist but does not empirically test its uptake or impact on subsequent research quality. Future work could involve a longitudinal study tracking whether journals or conferences adopt the norm and how it changes citation practices.
Tooling support – Developing automated linting or manuscript‑checking tools that flag missing identification assumptions would help operationalize the norm; this remains an open engineering challenge.

Authors

Zezheng Lin
Fengming Liu

Paper Information

arXiv ID: 2605.08012v1
Categories: cs.LG, cs.AI, cs.CL
Published: May 8, 2026
PDF: Download PDF

[Paper] Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

[Paper] CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation

[Paper] Fast Byte Latent Transformer

[Paper] Tool Calling is Linearly Readable and Steerable in Language Models