[Paper] Toward Explaining Large Language Models in Software Engineering Tasks
Source: arXiv - 2512.20328v1
Overview
The paper presents FeatureSHAP, a model‑agnostic framework that explains the decisions of large language models (LLMs) when they are used for software‑engineering (SE) tasks such as code generation and code summarization. By adapting Shapley‑value based attribution to the peculiarities of source code and natural‑language documentation, the authors aim to turn the “black‑box” of LLMs into a more transparent, trust‑worthy tool for developers working in safety‑critical or high‑impact environments.
Key Contributions
- FeatureSHAP framework – the first fully automated, domain‑specific explainability method for SE‑oriented LLMs.
- Task‑aware feature definition – maps raw tokens to high‑level SE concepts (e.g., API calls, control‑flow constructs, doc‑string sections) before computing Shapley values.
- Model‑agnostic design – works with any LLM, whether open‑source (e.g., LLaMA, CodeBERT) or proprietary (e.g., OpenAI Codex).
- Empirical evaluation – demonstrates higher fidelity and lower attribution to irrelevant inputs on code generation and summarization benchmarks compared with generic SHAP and attention‑based baselines.
- Human‑centered validation – a survey of 37 software practitioners shows that FeatureSHAP explanations improve confidence and decision‑making when reviewing model outputs.
- Open‑source release – the implementation is publicly available, encouraging reproducibility and community extensions.
Methodology
- Feature Engineering for SE – Input prompts are parsed into semantically meaningful groups (e.g., function signature, surrounding comments, imported libraries). Each group becomes a “feature” for attribution.
- Perturbation & Similarity – For a given feature, FeatureSHAP creates perturbed versions of the prompt by masking or replacing that feature while keeping the rest unchanged. The resulting model outputs are compared to the original using a task‑specific similarity metric (BLEU for generation, ROUGE for summarization).
- Shapley Value Approximation – Using Monte‑Carlo sampling, the framework estimates each feature’s contribution to the final output score, yielding a normalized importance score that sums to 1.
- Explanation Rendering – The scores are visualized alongside the original prompt, highlighting which parts of the code or comment most influenced the LLM’s answer.
- Evaluation Pipeline – The authors benchmark FeatureSHAP against baseline SHAP (token‑level) and attention‑weight visualizations across two datasets: (a) Python function generation from doc‑strings, and (b) code summarization from source snippets. Fidelity is measured by how well the attribution aligns with controlled “ground‑truth” feature importance (synthetically injected).
Results & Findings
- Higher Fidelity – FeatureSHAP’s attributions correlate 0.78 (code generation) and 0.74 (summarization) with the synthetic ground truth, outperforming token‑level SHAP (0.61 / 0.58) and attention baselines (≈0.45).
- Reduced Noise – Irrelevant features (e.g., unrelated import statements) receive near‑zero importance, whereas baseline methods often assign spurious weight.
- Human Study – 84 % of surveyed developers reported that FeatureSHAP explanations helped them spot erroneous model suggestions faster, and 71 % said they would be more willing to adopt LLM‑based tooling in regulated projects.
- Performance Overhead – The perturbation‑based approach adds ~2× inference time per explanation, which the authors deem acceptable for debugging or code‑review scenarios.
Practical Implications
- Debugging LLM‑Generated Code – Developers can quickly identify which parts of a prompt (e.g., a missing type hint) caused a faulty suggestion, enabling targeted prompt engineering.
- Compliance & Auditing – In domains like automotive or medical software, FeatureSHAP provides traceable evidence of why a model produced a particular implementation, supporting regulatory documentation.
- Tool Integration – The framework can be wrapped as a VS Code extension or CI‑pipeline plugin, surfacing explanations alongside AI‑assisted suggestions without requiring model retraining.
- Cross‑Model Portability – Teams using proprietary APIs (e.g., OpenAI) can still obtain explanations without exposing model internals, preserving IP while gaining transparency.
Limitations & Future Work
- Scalability – The Monte‑Carlo sampling required for Shapley estimation can become costly for very large prompts or multi‑file contexts.
- Feature Granularity – The current feature taxonomy is handcrafted for Python; extending to other languages or mixed‑language projects will need additional engineering.
- Dynamic Code – Runtime behavior (e.g., side effects, performance) is not captured; future work could combine static attributions with dynamic profiling.
- User Interaction – The study focused on static surveys; longitudinal studies in real development workflows are needed to quantify productivity gains.
FeatureSHAP marks a concrete step toward making LLM‑driven software engineering tools not just powerful, but also understandable and trustworthy for everyday developers.
Authors
- Antonio Vitale
- Khai‑Nguyen Nguyen
- Denys Poshyvanyk
- Rocco Oliveto
- Simone Scalabrino
- Antonio Mastropaolo
Paper Information
- arXiv ID: 2512.20328v1
- Categories: cs.SE, cs.AI, cs.LG
- Published: December 23, 2025
- PDF: Download PDF