[Paper] From XAI to Stories: A Factorial Study of LLM-Generated Explanation Quality

Published: 2 weeks ago (January 5, 2026 at 10:52 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.02224v1

Overview

The paper investigates how to turn raw, numeric explanations from XAI tools (like SHAP and LIME) into human‑readable stories using Large Language Models (LLMs). By systematically varying the forecasting model, XAI method, LLM, and prompting style, the authors uncover which ingredients actually matter for producing high‑quality natural‑language explanations (NLEs) in a time‑series forecasting context.

Key Contributions

Factorial experimental design covering 4 forecasting models, 3 XAI techniques (including a no‑XAI baseline), 3 LLMs, and 8 prompting strategies – a total of 660 generated explanations.
LLM‑as‑judge evaluation using G‑Eval with two independent LLM judges and four quality criteria (faithfulness, completeness, clarity, and usefulness).
Empirical finding that the choice of LLM outweighs all other factors, with DeepSeek‑R1 consistently outperforming GPT‑4o and Llama‑3‑8B.
Evidence that classic XAI methods add only marginal value for non‑expert users and can even be unnecessary when a strong LLM is used.
Discovery of an “interpretability paradox”: a more accurate classical model (SARIMAX) yields poorer NLEs than black‑box ML models.
Prompting insights: zero‑shot prompts match the quality of more expensive self‑consistency prompting, while chain‑of‑thought (CoT) degrades explanation quality.

Methodology

Forecasting models – Four models were trained on a standard time‑series dataset:
- XGBoost (XGB)
- Random Forest (RF)
- Multilayer Perceptron (MLP)
- SARIMAX (a statistical time‑series model)
XAI conditions – For each forecast, explanations were generated using:
- SHAP
- LIME
- No‑XAI (raw prediction only)
LLM generators – The numeric attributions (or raw predictions) were fed to three LLMs:
- GPT‑4o (OpenAI)
- Llama‑3‑8B (Meta)
- DeepSeek‑R1 (DeepSeek)
Prompting strategies – Eight variants ranging from simple zero‑shot prompts to self‑consistency (multiple sampled answers) and chain‑of‑thought prompts.
Evaluation – Using G‑Eval, two LLM judges independently scored each explanation on:
- Faithfulness (does it reflect the underlying attribution?)
- Completeness (covers all important features)
- Clarity (readability for the target audience)
- Usefulness (actionability for the user)
Scores were aggregated to produce an overall quality metric for each of the 660 explanations.

Results & Findings

Factor	Impact on NLE Quality
LLM choice	Dominant; DeepSeek‑R1 > GPT‑4o > Llama‑3‑8B
XAI method	Small boost over no‑XAI, but only noticeable for expert users
Forecasting model	SARIMAX (most accurate) produced the worst NLEs; ML models (XGB, RF, MLP) yielded richer stories
Prompting	Zero‑shot prompts performed on par with costly self‑consistency (≈7× cheaper); chain‑of‑thought reduced clarity and faithfulness
Audience	Non‑experts benefited little from SHAP/LIME; experts appreciated the marginal gains

Overall, the study suggests that a powerful LLM can compensate for the lack of sophisticated XAI post‑processing, while elaborate prompting may not be worth the extra compute budget.

Practical Implications

For product teams building AI dashboards: Investing in a strong LLM (or a fine‑tuned variant) may be more cost‑effective than integrating multiple XAI libraries, especially when the target users are non‑technical.
Prompt engineering budget: Simple zero‑shot prompts can deliver high‑quality explanations, freeing up compute resources for scaling or for other model inference tasks.
Model selection trade‑offs: When explainability is a key requirement, choosing a black‑box ML model that works well with LLMs may be preferable to a statistically superior but less “explainable” model like SARIMAX.
Developer tooling: SDKs that wrap SHAP/LIME outputs into a lightweight JSON payload for an LLM can be built once and reused across models, reducing engineering overhead.
Cost optimization: Self‑consistency (multiple sampled answers) can be avoided without sacrificing quality, cutting inference costs by up to 85 %.

Limitations & Future Work

Domain scope: The experiments focus on a single time‑series forecasting dataset; results may differ for classification, NLP, or computer‑vision tasks.
LLM judge reliability: Using LLMs as evaluators introduces potential bias; human validation was not part of the study.
Prompt diversity: Only eight prompting variants were tested; more nuanced prompt engineering (e.g., few‑shot examples, role‑playing) could reveal additional insights.
Explainability depth: The study measures surface‑level NLE quality but does not assess downstream decision‑making impact or user trust over time.

Future research could extend the factorial design to other domains, incorporate human user studies, and explore fine‑tuning LLMs specifically for explanation generation.

Authors

Fabian Lukassen
Jan Herrmann
Christoph Weisser
Benjamin Saefken
Thomas Kneib

Paper Information

arXiv ID: 2601.02224v1
Categories: cs.CL
Published: January 5, 2026
PDF: Download PDF

[Paper] From XAI to Stories: A Factorial Study of LLM-Generated Explanation Quality

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] How Long Is a Piece of String? A Brief Empirical Analysis of Tokenizers

[Paper] Do explanations generalize across large reasoning models?

[Paper] Building Production-Ready Probes For Gemini

[Paper] The Poisoned Apple Effect: Strategic Manipulation of Mediated Markets via Technology Expansion of AI Agents