[Paper] Toward Explaining Large Language Models in Software Engineering Tasks

Published: 1 month ago (December 23, 2025 at 07:56 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.20328v1

Overview

The paper presents FeatureSHAP, a model‑agnostic framework that explains the decisions of large language models (LLMs) when they are used for software‑engineering (SE) tasks such as code generation and code summarization. By adapting Shapley‑value based attribution to the peculiarities of source code and natural‑language documentation, the authors aim to turn the “black‑box” of LLMs into a more transparent, trust‑worthy tool for developers working in safety‑critical or high‑impact environments.

Key Contributions

FeatureSHAP framework – the first fully automated, domain‑specific explainability method for SE‑oriented LLMs.
Task‑aware feature definition – maps raw tokens to high‑level SE concepts (e.g., API calls, control‑flow constructs, doc‑string sections) before computing Shapley values.
Model‑agnostic design – works with any LLM, whether open‑source (e.g., LLaMA, CodeBERT) or proprietary (e.g., OpenAI Codex).
Empirical evaluation – demonstrates higher fidelity and lower attribution to irrelevant inputs on code generation and summarization benchmarks compared with generic SHAP and attention‑based baselines.
Human‑centered validation – a survey of 37 software practitioners shows that FeatureSHAP explanations improve confidence and decision‑making when reviewing model outputs.
Open‑source release – the implementation is publicly available, encouraging reproducibility and community extensions.

Methodology

Feature Engineering for SE – Input prompts are parsed into semantically meaningful groups (e.g., function signature, surrounding comments, imported libraries). Each group becomes a “feature” for attribution.
Perturbation & Similarity – For a given feature, FeatureSHAP creates perturbed versions of the prompt by masking or replacing that feature while keeping the rest unchanged. The resulting model outputs are compared to the original using a task‑specific similarity metric (BLEU for generation, ROUGE for summarization).
Shapley Value Approximation – Using Monte‑Carlo sampling, the framework estimates each feature’s contribution to the final output score, yielding a normalized importance score that sums to 1.
Explanation Rendering – The scores are visualized alongside the original prompt, highlighting which parts of the code or comment most influenced the LLM’s answer.
Evaluation Pipeline – The authors benchmark FeatureSHAP against baseline SHAP (token‑level) and attention‑weight visualizations across two datasets: (a) Python function generation from doc‑strings, and (b) code summarization from source snippets. Fidelity is measured by how well the attribution aligns with controlled “ground‑truth” feature importance (synthetically injected).

Results & Findings

Higher Fidelity – FeatureSHAP’s attributions correlate 0.78 (code generation) and 0.74 (summarization) with the synthetic ground truth, outperforming token‑level SHAP (0.61 / 0.58) and attention baselines (≈0.45).
Reduced Noise – Irrelevant features (e.g., unrelated import statements) receive near‑zero importance, whereas baseline methods often assign spurious weight.
Human Study – 84 % of surveyed developers reported that FeatureSHAP explanations helped them spot erroneous model suggestions faster, and 71 % said they would be more willing to adopt LLM‑based tooling in regulated projects.
Performance Overhead – The perturbation‑based approach adds ~2× inference time per explanation, which the authors deem acceptable for debugging or code‑review scenarios.

Practical Implications

Debugging LLM‑Generated Code – Developers can quickly identify which parts of a prompt (e.g., a missing type hint) caused a faulty suggestion, enabling targeted prompt engineering.
Compliance & Auditing – In domains like automotive or medical software, FeatureSHAP provides traceable evidence of why a model produced a particular implementation, supporting regulatory documentation.
Tool Integration – The framework can be wrapped as a VS Code extension or CI‑pipeline plugin, surfacing explanations alongside AI‑assisted suggestions without requiring model retraining.
Cross‑Model Portability – Teams using proprietary APIs (e.g., OpenAI) can still obtain explanations without exposing model internals, preserving IP while gaining transparency.

Limitations & Future Work

Scalability – The Monte‑Carlo sampling required for Shapley estimation can become costly for very large prompts or multi‑file contexts.
Feature Granularity – The current feature taxonomy is handcrafted for Python; extending to other languages or mixed‑language projects will need additional engineering.
Dynamic Code – Runtime behavior (e.g., side effects, performance) is not captured; future work could combine static attributions with dynamic profiling.
User Interaction – The study focused on static surveys; longitudinal studies in real development workflows are needed to quantify productivity gains.

FeatureSHAP marks a concrete step toward making LLM‑driven software engineering tools not just powerful, but also understandable and trustworthy for everyday developers.

Authors

Antonio Vitale
Khai‑Nguyen Nguyen
Denys Poshyvanyk
Rocco Oliveto
Simone Scalabrino
Antonio Mastropaolo

Paper Information

arXiv ID: 2512.20328v1
Categories: cs.SE, cs.AI, cs.LG
Published: December 23, 2025
PDF: Download PDF

[Paper] Toward Explaining Large Language Models in Software Engineering Tasks

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Agentic Structured Graph Traversal for Root Cause Analysis of Code-related Incidents in Cloud Applications

[Paper] Pruning as a Game: Equilibrium-Driven Sparsification of Neural Networks

[Paper] Explainable Multimodal Regression via Information Decomposition

[Paper] A2P-Vis: an Analyzer-to-Presenter Agentic Pipeline for Visual Insights Generation and Reporting