[Paper] Detecting UX smells in Visual Studio Code using LLMs

Published: (February 25, 2026 at 10:32 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.22020v1

Overview

The paper introduces a novel, LLM‑driven pipeline for surfacing “UX smells”—recurrent usability problems—in Visual Studio Code (VS Code). By mining thousands of GitHub‑hosted issue reports and classifying them against a validated UX taxonomy, the authors reveal where the editor’s user experience falls short, offering a data‑backed roadmap for improvement.

Key Contributions

  • LLM‑assisted mining pipeline: Combines large language models with keyword filtering to automatically extract candidate UX‑related issues from VS Code’s public GitHub repository.
  • Validated UX smell taxonomy: Applies a previously vetted classification scheme (informativeness, clarity, intuitiveness, efficiency, etc.) to the extracted issues, ensuring consistent labeling.
  • Empirical dataset: Publishes a curated set of 1,200+ labeled UX‑smell instances, complete with issue URLs and model confidence scores, for reproducibility.
  • Insightful distribution analysis: Shows that >70 % of identified smells cluster in four categories—informativeness, clarity, intuitiveness, and efficiency—highlighting the most pain‑point areas for developers.
  • Expert verification loop: Involves UX researchers and seasoned VS Code contributors to validate a random sample of model predictions, achieving a 0.84 Cohen’s κ agreement.

Methodology

  1. Data Collection – The authors scraped all open and closed issues from the official VS Code GitHub repo (≈ 30 k entries).
  2. Pre‑filtering – Simple lexical cues (e.g., “confusing”, “slow”, “hard to find”) reduced the set to ~4 k potentially UX‑related tickets.
  3. LLM Classification – A fine‑tuned GPT‑3.5 model was prompted with the taxonomy definitions and asked to assign one or more UX smell labels to each issue description and comments.
  4. Human Review – A panel of three UX experts independently reviewed a stratified 10 % sample, resolving disagreements and providing feedback to refine the prompt and post‑processing rules.
  5. Aggregation & Analysis – Final labels were aggregated, and frequency, co‑occurrence, and temporal trends were visualized to surface hot‑spots in the editor’s UX.

Results & Findings

  • Distribution: Informativeness (28 %), clarity (22 %), intuitiveness (15 %), and efficiency (12 %) together account for 77 % of all detected smells.
  • Severity: Issues tagged as “efficiency” often correspond to performance bottlenecks (e.g., laggy extensions), while “clarity” problems frequently involve ambiguous UI icons or tooltips.
  • Temporal trend: A noticeable dip in new “clarity” smells after the release of VS Code 1.80 suggests that targeted UI redesigns can quickly reduce certain UX problems.
  • Model performance: On the expert‑validated sample, the LLM achieved 81 % precision and 78 % recall across the taxonomy, outperforming a baseline keyword‑only classifier (62 % / 55 %).

Practical Implications

  • Prioritized bug triage: Development teams can automatically flag incoming GitHub issues that likely contain UX smells, routing them to UI/UX designers before they become larger pain points.
  • Data‑driven UI redesign: The concentration of smells in specific categories gives product managers concrete targets (e.g., improve tooltip wording, streamline command palette) for the next release cycle.
  • Extension ecosystem health: By surfacing efficiency‑related smells, maintainers can identify extensions that degrade performance and provide guidance for optimization.
  • Continuous monitoring: The pipeline can be set up as a CI‑style watchdog, periodically re‑scanning the issue tracker to detect emerging UX regressions after new feature rollouts.
  • Cross‑tool applicability: The same LLM‑assisted approach can be adapted to other IDEs (IntelliJ, Eclipse) or developer‑focused platforms (Docker Desktop, Postman), enabling a broader “UX health dashboard” for the tooling ecosystem.

Limitations & Future Work

  • Language bias: The model was trained on English‑only issue text, potentially missing UX smells reported in other languages.
  • Context loss: Short issue titles sometimes lack sufficient context for accurate classification, leading to false negatives.
  • Taxonomy scope: The chosen taxonomy, while validated, may omit emerging UX dimensions such as accessibility or collaborative editing.
  • Future directions: The authors plan to incorporate multimodal data (screenshots, video demos) and to experiment with newer instruction‑tuned LLMs (e.g., GPT‑4) to boost classification fidelity. They also aim to open‑source the entire pipeline as a plug‑in for GitHub Actions, making it easier for other projects to adopt the methodology.

Authors

  • Andrés Rodriguez
  • Juan Cruz Gardey
  • Alejandra Garrido

Paper Information

  • arXiv ID: 2602.22020v1
  • Categories: cs.SE, cs.HC
  • Published: February 25, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »