[Paper] Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims

Published: (May 8, 2026 at 01:01 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2605.08012v1

Overview

The paper Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims critiques a growing trend in mechanistic interpretability research: authors routinely frame their findings in causal language (e.g., “circuits,” “mediators,” “causal abstraction”) without spelling out the hidden assumptions that make those causal claims valid. By auditing ten recent papers, the authors demonstrate that the community lacks a systematic practice for stating identification assumptions, and they propose a concrete disclosure norm to fix the problem.

Key Contributions

  • Systematic audit of 10 mechanistic‑interpretability papers across four methodological families, revealing a consistent absence of dedicated identification‑assumption sections.
  • Empirical replication with a two‑coder, 30‑paper sample that confirms the original audit’s findings (the result is robust to coding rules).
  • Critical analysis of how validation metrics (faithfulness, completeness, monosemanticity, alignment, ablation effects) are mistakenly presented as proof of causality.
  • Norm proposal: a concise checklist for authors to disclose causal claims, the underlying identification strategy, all required assumptions, and the impact of assumption violations.
  • Clarification that “validation ≠ identification,” urging the community to treat them as distinct steps in causal inference.

Methodology

  1. Paper selection – The authors curated ten influential mechanistic‑interpretability studies representing four common approaches (circuit analysis, mediator discovery, causal abstraction, and monosemantic probing).
  2. Coding scheme – Two independent human coders examined each paper for:
    • Presence of a dedicated “identification assumptions” section.
    • Whether validation metrics were used as stand‑alone causal evidence.
    • Explicit statement of causal intent.
  3. Replication audit – To test robustness, a second audit of 30 additional papers (selected via keyword search) was performed using the same coding rules. Discrepancies were resolved through discussion, and inter‑coder agreement was reported.
  4. Synthesis – Findings from both audits were aggregated, and patterns were distilled into a proposed disclosure norm.

Results & Findings

  • Zero papers in the original ten‑paper set contained a separate section that listed identification assumptions.
  • Validation‑metric substitution was observed in 8/10 papers: authors cited high faithfulness or ablation scores as “evidence of causality” without justifying why those metrics identify the underlying mechanism.
  • The replication audit (30 papers) showed the same trend (≈ 85 % of papers omitted explicit assumptions), confirming that the issue is pervasive rather than an artifact of the initial sample.
  • Inter‑coder reliability was high (Cohen’s κ ≈ 0.78), indicating that the coding scheme reliably captured the phenomenon.
  • The authors’ disclosure norm (claim → strategy → assumptions → stress + counterfactual) was shown to be concise (≈ 3‑4 sentences) yet sufficient to make causal reasoning transparent.

Practical Implications

  • For developers building interpretability tools – Knowing the exact assumptions behind a “causal” claim helps you decide whether a tool’s output can be trusted for debugging, safety checks, or model‑editing pipelines.
  • For AI product teams – The norm gives a checklist for internal review processes, ensuring that any causal explanation presented to stakeholders (e.g., regulators, customers) is backed by a clear identification argument.
  • For open‑source libraries – Implementers can expose the underlying assumptions as metadata (e.g., explanation.causal_assumptions = [...]), making downstream usage more responsible.
  • For research reproducibility – Explicit assumption disclosure simplifies replication: other teams can test what happens when an assumption is violated, leading to more robust, generalizable interpretability methods.
  • For policy and compliance – When interpretability claims are used in audits or compliance reports, the proposed norm provides a defensible way to separate “observational validation” from “causal inference,” reducing legal risk.

Limitations & Future Work

  • Scope of audit – The study focused on papers that already use causal terminology; it may miss subtler cases where causal language is implicit.
  • Coding granularity – While two coders achieved good agreement, the binary presence/absence of an assumption section could overlook nuanced discussions embedded elsewhere in the text.
  • Norm adoption – The paper proposes a disclosure checklist but does not empirically test its uptake or impact on subsequent research quality. Future work could involve a longitudinal study tracking whether journals or conferences adopt the norm and how it changes citation practices.
  • Tooling support – Developing automated linting or manuscript‑checking tools that flag missing identification assumptions would help operationalize the norm; this remains an open engineering challenge.

Authors

  • Zezheng Lin
  • Fengming Liu

Paper Information

  • arXiv ID: 2605.08012v1
  • Categories: cs.LG, cs.AI, cs.CL
  • Published: May 8, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Fast Byte Latent Transformer

Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their utility is limited by slo...