[Paper] From XAI to Stories: A Factorial Study of LLM-Generated Explanation Quality
Source: arXiv - 2601.02224v1
Overview
The paper investigates how to turn raw, numeric explanations from XAI tools (like SHAP and LIME) into human‑readable stories using Large Language Models (LLMs). By systematically varying the forecasting model, XAI method, LLM, and prompting style, the authors uncover which ingredients actually matter for producing high‑quality natural‑language explanations (NLEs) in a time‑series forecasting context.
Key Contributions
- Factorial experimental design covering 4 forecasting models, 3 XAI techniques (including a no‑XAI baseline), 3 LLMs, and 8 prompting strategies – a total of 660 generated explanations.
- LLM‑as‑judge evaluation using G‑Eval with two independent LLM judges and four quality criteria (faithfulness, completeness, clarity, and usefulness).
- Empirical finding that the choice of LLM outweighs all other factors, with DeepSeek‑R1 consistently outperforming GPT‑4o and Llama‑3‑8B.
- Evidence that classic XAI methods add only marginal value for non‑expert users and can even be unnecessary when a strong LLM is used.
- Discovery of an “interpretability paradox”: a more accurate classical model (SARIMAX) yields poorer NLEs than black‑box ML models.
- Prompting insights: zero‑shot prompts match the quality of more expensive self‑consistency prompting, while chain‑of‑thought (CoT) degrades explanation quality.
Methodology
-
Forecasting models – Four models were trained on a standard time‑series dataset:
- XGBoost (XGB)
- Random Forest (RF)
- Multilayer Perceptron (MLP)
- SARIMAX (a statistical time‑series model)
-
XAI conditions – For each forecast, explanations were generated using:
- SHAP
- LIME
- No‑XAI (raw prediction only)
-
LLM generators – The numeric attributions (or raw predictions) were fed to three LLMs:
- GPT‑4o (OpenAI)
- Llama‑3‑8B (Meta)
- DeepSeek‑R1 (DeepSeek)
-
Prompting strategies – Eight variants ranging from simple zero‑shot prompts to self‑consistency (multiple sampled answers) and chain‑of‑thought prompts.
-
Evaluation – Using G‑Eval, two LLM judges independently scored each explanation on:
- Faithfulness (does it reflect the underlying attribution?)
- Completeness (covers all important features)
- Clarity (readability for the target audience)
- Usefulness (actionability for the user)
Scores were aggregated to produce an overall quality metric for each of the 660 explanations.
Results & Findings
| Factor | Impact on NLE Quality |
|---|---|
| LLM choice | Dominant; DeepSeek‑R1 > GPT‑4o > Llama‑3‑8B |
| XAI method | Small boost over no‑XAI, but only noticeable for expert users |
| Forecasting model | SARIMAX (most accurate) produced the worst NLEs; ML models (XGB, RF, MLP) yielded richer stories |
| Prompting | Zero‑shot prompts performed on par with costly self‑consistency (≈7× cheaper); chain‑of‑thought reduced clarity and faithfulness |
| Audience | Non‑experts benefited little from SHAP/LIME; experts appreciated the marginal gains |
Overall, the study suggests that a powerful LLM can compensate for the lack of sophisticated XAI post‑processing, while elaborate prompting may not be worth the extra compute budget.
Practical Implications
- For product teams building AI dashboards: Investing in a strong LLM (or a fine‑tuned variant) may be more cost‑effective than integrating multiple XAI libraries, especially when the target users are non‑technical.
- Prompt engineering budget: Simple zero‑shot prompts can deliver high‑quality explanations, freeing up compute resources for scaling or for other model inference tasks.
- Model selection trade‑offs: When explainability is a key requirement, choosing a black‑box ML model that works well with LLMs may be preferable to a statistically superior but less “explainable” model like SARIMAX.
- Developer tooling: SDKs that wrap SHAP/LIME outputs into a lightweight JSON payload for an LLM can be built once and reused across models, reducing engineering overhead.
- Cost optimization: Self‑consistency (multiple sampled answers) can be avoided without sacrificing quality, cutting inference costs by up to 85 %.
Limitations & Future Work
- Domain scope: The experiments focus on a single time‑series forecasting dataset; results may differ for classification, NLP, or computer‑vision tasks.
- LLM judge reliability: Using LLMs as evaluators introduces potential bias; human validation was not part of the study.
- Prompt diversity: Only eight prompting variants were tested; more nuanced prompt engineering (e.g., few‑shot examples, role‑playing) could reveal additional insights.
- Explainability depth: The study measures surface‑level NLE quality but does not assess downstream decision‑making impact or user trust over time.
Future research could extend the factorial design to other domains, incorporate human user studies, and explore fine‑tuning LLMs specifically for explanation generation.
Authors
- Fabian Lukassen
- Jan Herrmann
- Christoph Weisser
- Benjamin Saefken
- Thomas Kneib
Paper Information
- arXiv ID: 2601.02224v1
- Categories: cs.CL
- Published: January 5, 2026
- PDF: Download PDF