[Paper] Detecting UX smells in Visual Studio Code using LLMs

Published: 3 days ago (February 25, 2026 at 10:32 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.22020v1

Overview

The paper introduces a novel, LLM‑driven pipeline for surfacing “UX smells”—recurrent usability problems—in Visual Studio Code (VS Code). By mining thousands of GitHub‑hosted issue reports and classifying them against a validated UX taxonomy, the authors reveal where the editor’s user experience falls short, offering a data‑backed roadmap for improvement.

Key Contributions

LLM‑assisted mining pipeline: Combines large language models with keyword filtering to automatically extract candidate UX‑related issues from VS Code’s public GitHub repository.
Validated UX smell taxonomy: Applies a previously vetted classification scheme (informativeness, clarity, intuitiveness, efficiency, etc.) to the extracted issues, ensuring consistent labeling.
Empirical dataset: Publishes a curated set of 1,200+ labeled UX‑smell instances, complete with issue URLs and model confidence scores, for reproducibility.
Insightful distribution analysis: Shows that >70 % of identified smells cluster in four categories—informativeness, clarity, intuitiveness, and efficiency—highlighting the most pain‑point areas for developers.
Expert verification loop: Involves UX researchers and seasoned VS Code contributors to validate a random sample of model predictions, achieving a 0.84 Cohen’s κ agreement.

Methodology

Data Collection – The authors scraped all open and closed issues from the official VS Code GitHub repo (≈ 30 k entries).
Pre‑filtering – Simple lexical cues (e.g., “confusing”, “slow”, “hard to find”) reduced the set to ~4 k potentially UX‑related tickets.
LLM Classification – A fine‑tuned GPT‑3.5 model was prompted with the taxonomy definitions and asked to assign one or more UX smell labels to each issue description and comments.
Human Review – A panel of three UX experts independently reviewed a stratified 10 % sample, resolving disagreements and providing feedback to refine the prompt and post‑processing rules.
Aggregation & Analysis – Final labels were aggregated, and frequency, co‑occurrence, and temporal trends were visualized to surface hot‑spots in the editor’s UX.

Results & Findings

Distribution: Informativeness (28 %), clarity (22 %), intuitiveness (15 %), and efficiency (12 %) together account for 77 % of all detected smells.
Severity: Issues tagged as “efficiency” often correspond to performance bottlenecks (e.g., laggy extensions), while “clarity” problems frequently involve ambiguous UI icons or tooltips.
Temporal trend: A noticeable dip in new “clarity” smells after the release of VS Code 1.80 suggests that targeted UI redesigns can quickly reduce certain UX problems.
Model performance: On the expert‑validated sample, the LLM achieved 81 % precision and 78 % recall across the taxonomy, outperforming a baseline keyword‑only classifier (62 % / 55 %).

Practical Implications

Prioritized bug triage: Development teams can automatically flag incoming GitHub issues that likely contain UX smells, routing them to UI/UX designers before they become larger pain points.
Data‑driven UI redesign: The concentration of smells in specific categories gives product managers concrete targets (e.g., improve tooltip wording, streamline command palette) for the next release cycle.
Extension ecosystem health: By surfacing efficiency‑related smells, maintainers can identify extensions that degrade performance and provide guidance for optimization.
Continuous monitoring: The pipeline can be set up as a CI‑style watchdog, periodically re‑scanning the issue tracker to detect emerging UX regressions after new feature rollouts.
Cross‑tool applicability: The same LLM‑assisted approach can be adapted to other IDEs (IntelliJ, Eclipse) or developer‑focused platforms (Docker Desktop, Postman), enabling a broader “UX health dashboard” for the tooling ecosystem.

Limitations & Future Work

Language bias: The model was trained on English‑only issue text, potentially missing UX smells reported in other languages.
Context loss: Short issue titles sometimes lack sufficient context for accurate classification, leading to false negatives.
Taxonomy scope: The chosen taxonomy, while validated, may omit emerging UX dimensions such as accessibility or collaborative editing.
Future directions: The authors plan to incorporate multimodal data (screenshots, video demos) and to experiment with newer instruction‑tuned LLMs (e.g., GPT‑4) to boost classification fidelity. They also aim to open‑source the entire pipeline as a plug‑in for GitHub Actions, making it easier for other projects to adopt the methodology.

Authors

Andrés Rodriguez
Juan Cruz Gardey
Alejandra Garrido

Paper Information

arXiv ID: 2602.22020v1
Categories: cs.SE, cs.HC
Published: February 25, 2026
PDF: Download PDF

[Paper] Detecting UX smells in Visual Studio Code using LLMs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Array-Carrying Symbolic Execution for Function Contract Generation

[Paper] LLM-Powered Silent Bug Fuzzing in Deep Learning Libraries via Versatile and Controlled Bug Transfer

[Paper] CL4SE: A Context Learning Benchmark For Software Engineering Tasks

[Paper] Managing Uncertainty in LLM-based Multi-Agent System Operation