[Paper] LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals

Published: 3 weeks ago (January 15, 2026 at 01:54 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.10700v1

Overview

The paper introduces LIBERTy, a new benchmarking framework that uses structural counterfactuals—synthetically generated “what‑if” text examples—to evaluate how faithfully concept‑based explanations capture the causal influence of high‑level attributes (e.g., gender, disease status) on large language models (LLMs). By automating the creation of counterfactual pairs through explicit causal graphs, the authors provide a scalable, reproducible way to test explainability methods without relying on expensive human‑written edits.

Key Contributions

LIBERTy framework: A systematic pipeline that builds Structured Causal Models (SCMs) of text generation and automatically produces intervention‑driven counterfactuals.
Three domain‑specific datasets:
1. Disease detection from clinical notes
2. Computer‑vision (CV) screening reports (e.g., radiology)
3. Workplace‑violence risk prediction
Order‑faithfulness metric: A novel evaluation that checks whether an explanation correctly ranks concepts by their true causal impact, rather than just matching absolute effect sizes.
Comprehensive benchmark: Evaluation of dozens of concept‑based explanation methods across five LLMs (including proprietary models) reveals large gaps between current performance and the theoretical optimum.
Sensitivity analysis: Demonstrates that many commercial LLMs are deliberately less responsive to demographic concepts, suggesting the presence of post‑training mitigation strategies.

Methodology

Define an SCM for each task – Model the generative process of a text (e.g., a clinical note) as a directed graph where nodes represent latent variables (disease, patient age, gender, etc.) and edges encode causal relationships.
Intervene on a concept – To test a concept C, replace its value in the SCM (e.g., flip gender from “female” to “male”) while keeping everything else unchanged.
Propagate the intervention – The altered node triggers downstream changes (e.g., symptom description, risk scores) according to the SCM’s functional equations.
Generate the counterfactual text – Prompt an LLM with the modified latent variables, producing a new piece of text that reflects the intervention. This yields a paired dataset: original vs. counterfactual.
Estimate ground‑truth causal effects – Compare model predictions on the two texts to obtain a reference causal effect for each concept.
Evaluate explanations – Run existing concept‑based explanation methods (e.g., probing classifiers, attention‑based scores, gradient‑based attributions) on the original text and compare their estimated effects to the reference using the order‑faithfulness metric.

The whole pipeline is fully automated, requiring only the specification of the SCM and a set of concept variables.

Results & Findings

Performance gap: Even the best‑performing explanation methods achieve only ~55 % order‑faithfulness, far from the 100 % ceiling, indicating substantial room for improvement.
Model‑specific behavior: Open‑source LLMs (e.g., LLaMA, Falcon) show higher sensitivity to demographic concepts than closed‑source commercial models (e.g., GPT‑4), which often dampen the effect of gender or ethnicity.
Concept difficulty: Clinical disease concepts are easier to capture than nuanced social concepts (e.g., workplace violence triggers), suggesting that the granularity of the SCM matters.
Method ranking: Gradient‑based attribution methods generally outperform simple attention‑weight heuristics, but probing classifiers remain competitive when fine‑tuned on the target domain.
Robustness to noise: Introducing stochasticity in the SCM (e.g., random symptom phrasing) only modestly degrades explanation quality, confirming that LIBERTy’s counterfactuals are resilient to linguistic variation.

Practical Implications

Better debugging tools: Developers can use LIBERTy to stress‑test their LLM‑based pipelines (e.g., triage bots, automated report generators) and spot hidden biases before deployment.
Regulatory compliance: The framework supplies a quantifiable, auditable measure of explanation faithfulness that aligns with emerging AI‑risk regulations (e.g., EU AI Act).
Model selection: Companies can compare proprietary and open‑source LLMs not just on accuracy but on how transparently they expose concept influences, informing procurement decisions.
Guiding mitigation: By revealing which concepts a model is overly sensitive to, LIBERTy can direct targeted post‑training interventions (e.g., fine‑tuning, prompt engineering) to reduce unwanted bias.
Accelerating research: The publicly released datasets and code lower the barrier for new explainability methods, fostering rapid iteration and community‑wide standards.

Limitations & Future Work

SCM fidelity: The quality of counterfactuals hinges on how accurately the handcrafted SCM mirrors real‑world causal relations; oversimplified graphs may miss hidden confounders.
Domain coverage: LIBERTy currently focuses on three domains; extending to conversational agents, code generation, or multilingual settings will test the framework’s generality.
Human validation: While synthetic, the counterfactual texts have not been exhaustively vetted by domain experts for clinical realism, which could affect downstream trust.
Scalability to very large models: Generating counterfactuals for multi‑billion‑parameter LLMs incurs non‑trivial compute costs; future work could explore more efficient intervention strategies.

Overall, LIBERTy marks a significant step toward rigorous, scalable evaluation of concept‑based explanations for LLMs, offering developers a practical tool to build more transparent and trustworthy AI systems.

Authors

Gilat Toker
Nitay Calderon
Ohad Amosy
Roi Reichart

Paper Information

arXiv ID: 2601.10700v1
Categories: cs.CL, cs.AI
Published: January 15, 2026
PDF: Download PDF

[Paper] LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Do explanations generalize across large reasoning models?

[Paper] Building Production-Ready Probes For Gemini

[Paper] The Poisoned Apple Effect: Strategic Manipulation of Mediated Markets via Technology Expansion of AI Agents

[Paper] MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models