[Paper] Patterns of Developer Adoption of LLM-Generated Code Refactoring Suggestions
Source: arXiv - 2605.04835v1
Overview
The paper investigates how software engineers actually use refactoring suggestions generated by large language models (LLMs) such as ChatGPT. By mining 169 real‑world GitHub commits that reference a ChatGPT conversation, the authors reveal the patterns that emerge when developers accept, modify, or reject AI‑driven refactorings. The study bridges the gap between academic evaluations of LLM output and the day‑to‑day workflow of developers.
Key Contributions
- Empirical dataset of 169 GitHub commits linking ChatGPT‑generated refactoring advice to concrete code changes.
- Adoption taxonomy that classifies developer responses into five distinct patterns (e.g., straight acceptance, major rewrites, prompt‑driven refinements).
- Insightful correlation between the nature of the original prompt, the validity of the LLM’s answer, and the type of modification developers perform.
- Evidence that most suggestions are used as‑is, highlighting a high trust level in current LLMs for refactoring tasks.
- Guidelines for tool builders on how to surface, validate, and integrate LLM refactoring suggestions into IDEs and CI pipelines.
Methodology
- Data collection – The authors searched public GitHub repositories for commit messages that contain a URL to a ChatGPT conversation. Each match was manually verified to ensure the commit indeed applied a refactoring suggested by the model.
- Commit analysis – For every commit, the original code, the LLM’s suggestion, and the final code after the commit were compared.
- Pattern identification – Using qualitative coding, the researchers grouped the observed developer actions into five high‑level patterns, taking into account:
- The refactoring activity (e.g., rename, extract method, simplify condition).
- The prompt the developer gave to ChatGPT (clarity, specificity).
- The validity of the model’s answer (correct, partially correct, or erroneous).
- Quantitative summary – Frequencies of each pattern were computed, and statistical checks were performed to see whether prompt quality or answer correctness significantly influenced the chosen pattern.
Results & Findings
- Straight acceptance dominates – ≈ 68 % of commits applied the LLM’s suggestion without any changes.
- When developers intervene, the modifications are usually major (e.g., restructuring the suggested code, adding missing error handling) rather than minor tweaks.
- Five adoption patterns emerged:
1. Direct adoption
Copy‑paste the suggestion.
2. Prompt‑driven refinement
Developers ask follow‑up questions, receive a revised suggestion, then adopt it.
3. Partial integration
Only a subset of the suggestion fits the codebase.
4. Error correction
Developers fix syntactic or logical mistakes in the LLM output before committing.
5. Rejection & rewrite
The suggestion is discarded and a completely different refactoring is performed.
- Prompt quality matters – clearer, more constrained prompts lead to higher rates of direct adoption.
- Model validity is critical – when ChatGPT’s answer contains errors, developers tend to fall into the “error correction” or “rejection & rewrite” patterns.
Practical Implications
- IDE plugins can surface confidence scores or validation checks (e.g., static analysis) alongside LLM suggestions to reduce the “error correction” workload.
- Prompt‑engineering tools (templates, autocomplete for prompts) could increase the proportion of direct adoptions, saving developer time.
- CI/CD integration – automated tests can be run on LLM‑generated patches before they are merged, catching the few cases where the model hallucinates.
- Team workflows – documenting the prompt‑refactoring pattern can help teams standardize how they review AI‑generated changes, ensuring consistency and traceability.
- Productivity boost – given that most developers accept suggestions unchanged, LLM‑driven refactoring can be a low‑friction way to improve code readability and maintainability at scale.
Limitations & Future Work
- Sample bias – the dataset only includes commits that explicitly link to a ChatGPT conversation, possibly over‑representing developers who are already enthusiastic about AI assistance.
- Language & domain scope – the study focuses on a handful of popular languages (e.g., Python, JavaScript); results may differ for systems languages or domain‑specific code.
- Temporal snapshot – the analysis reflects a specific version of ChatGPT; rapid model updates could shift adoption patterns.
- Future directions suggested by the authors include expanding the dataset to other LLMs, exploring automated detection of low‑quality suggestions, and conducting controlled user studies to measure productivity gains more precisely.
Authors
- David Schön
- Faiza Amjad
- Tehreem Asif
- Ranim Khojah
- Mazen Mohamad
- Francisco Gomes de Oliveira Neto
- Philipp Leitner
Paper Information
- arXiv ID: 2605.04835v1
- Categories: cs.SE, cs.HC
- Published: May 6, 2026
- PDF: Download PDF