[Paper] Efficient Discovery of Approximate Causal Abstractions via Neural Mechanism Sparsification
Source: arXiv - 2602.24266v1
Overview
The paper proposes a new way to uncover high‑level causal explanations hidden inside trained neural networks without costly retraining or exhaustive intervention experiments. By treating pruning as a search for an approximate causal abstraction, the authors derive a principled, fast method that extracts a sparse, intervention‑faithful structural causal model (SCM) from any deterministic network.
Key Contributions
- Reframing abstraction discovery as a structured pruning problem, linking model compression to causal analysis.
- Derivation of an Interventional Risk objective that quantifies how well a pruned network preserves the effects of interventions.
- Closed‑form second‑order expansion that yields simple criteria for (a) fixing a unit to a constant and (b) merging a unit into its neighbors.
- Demonstration that, under uniform curvature, the score collapses to activation variance, providing a theoretical justification (and limitation) for variance‑based pruning.
- An efficient algorithm that extracts sparse, intervention‑faithful abstractions from pretrained models, validated with interchange‑intervention experiments.
Methodology
- Treat the trained network as a deterministic SCM – each neuron is a variable, and the forward pass defines functional relationships.
- Define Interventional Risk: the expected discrepancy between the original network’s output under an intervention and the output of a candidate abstracted model under the same intervention.
- Second‑order Taylor expansion of this risk yields a tractable expression involving the curvature (second derivatives) of the network’s functions.
- Pruning decisions:
- Constant replacement: a unit can be set to a fixed value if its contribution to the risk (a function of its activation variance and curvature) is low.
- Folding: a unit can be merged into a neighboring unit if the combined risk remains small.
- Uniform curvature assumption simplifies the score to activation variance, connecting to classic magnitude‑based pruning.
- Iterative search: repeatedly apply the above criteria to produce a sparse abstraction, stopping when a user‑specified risk budget is reached.
Results & Findings
- On standard vision benchmarks (e.g., CIFAR‑10/100) the method reduces network size by 70‑90 % while keeping interventional fidelity above 95 % (measured via interchange interventions).
- Compared to brute‑force interchange‑intervention search, the proposed approach achieves orders‑of‑magnitude speedups (minutes vs. hours).
- When curvature is non‑uniform, variance‑only pruning fails to preserve causal behavior, whereas the curvature‑aware score maintains fidelity, confirming the theoretical analysis.
- The extracted abstractions often align with human‑interpretable concepts (e.g., edge detectors, texture filters) suggesting the method surfaces meaningful causal mechanisms.
Practical Implications
- Model debugging & safety: Developers can quickly obtain a causal map of a network to understand how interventions (e.g., feature masking) affect predictions, aiding in root‑cause analysis of failures.
- Efficient deployment: The sparse abstractions can serve as lightweight surrogates for inference in resource‑constrained environments while guaranteeing that key causal relationships remain intact.
- Explainable AI tooling: The algorithm can be integrated into existing ML pipelines to generate post‑hoc explanations that are faithful under counterfactual queries, a step beyond gradient‑based saliency.
- Transfer learning: Abstracted causal modules can be reused across tasks, potentially reducing the data and compute needed for fine‑tuning.
Limitations & Future Work
- The current theory assumes deterministic networks; stochastic layers (e.g., dropout, Bayesian nets) are not directly handled.
- The uniform curvature simplification may not hold for highly non‑linear architectures (e.g., transformers), limiting the variance‑only pruning shortcut.
- Experiments focus on image classification; extending validation to NLP or reinforcement learning domains remains open.
- Future research could explore adaptive curvature estimation, incorporate causal discovery from data (instead of a given network), and investigate interactive tools for developers to query and edit the extracted abstractions.
Authors
- Amir Asiaee
Paper Information
- arXiv ID: 2602.24266v1
- Categories: cs.LG, cs.AI
- Published: February 27, 2026
- PDF: Download PDF