[Paper] Patterns of Developer Adoption of LLM-Generated Code Refactoring Suggestions

Published: 5 days ago (May 6, 2026 at 08:31 AM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.04835v1

Overview

The paper investigates how software engineers actually use refactoring suggestions generated by large language models (LLMs) such as ChatGPT. By mining 169 real‑world GitHub commits that reference a ChatGPT conversation, the authors reveal the patterns that emerge when developers accept, modify, or reject AI‑driven refactorings. The study bridges the gap between academic evaluations of LLM output and the day‑to‑day workflow of developers.

Key Contributions

Empirical dataset of 169 GitHub commits linking ChatGPT‑generated refactoring advice to concrete code changes.
Adoption taxonomy that classifies developer responses into five distinct patterns (e.g., straight acceptance, major rewrites, prompt‑driven refinements).
Insightful correlation between the nature of the original prompt, the validity of the LLM’s answer, and the type of modification developers perform.
Evidence that most suggestions are used as‑is, highlighting a high trust level in current LLMs for refactoring tasks.
Guidelines for tool builders on how to surface, validate, and integrate LLM refactoring suggestions into IDEs and CI pipelines.

Methodology

Data collection – The authors searched public GitHub repositories for commit messages that contain a URL to a ChatGPT conversation. Each match was manually verified to ensure the commit indeed applied a refactoring suggested by the model.
Commit analysis – For every commit, the original code, the LLM’s suggestion, and the final code after the commit were compared.
Pattern identification – Using qualitative coding, the researchers grouped the observed developer actions into five high‑level patterns, taking into account:
- The refactoring activity (e.g., rename, extract method, simplify condition).
- The prompt the developer gave to ChatGPT (clarity, specificity).
- The validity of the model’s answer (correct, partially correct, or erroneous).
Quantitative summary – Frequencies of each pattern were computed, and statistical checks were performed to see whether prompt quality or answer correctness significantly influenced the chosen pattern.

Results & Findings

Straight acceptance dominates – ≈ 68 % of commits applied the LLM’s suggestion without any changes.
When developers intervene, the modifications are usually major (e.g., restructuring the suggested code, adding missing error handling) rather than minor tweaks.
Five adoption patterns emerged:

1. Direct adoption

Copy‑paste the suggestion.

Developers ask follow‑up questions, receive a revised suggestion, then adopt it.

3. Partial integration

Only a subset of the suggestion fits the codebase.

4. Error correction

Developers fix syntactic or logical mistakes in the LLM output before committing.

5. Rejection & rewrite

The suggestion is discarded and a completely different refactoring is performed.

Prompt quality matters – clearer, more constrained prompts lead to higher rates of direct adoption.
Model validity is critical – when ChatGPT’s answer contains errors, developers tend to fall into the “error correction” or “rejection & rewrite” patterns.

Practical Implications

IDE plugins can surface confidence scores or validation checks (e.g., static analysis) alongside LLM suggestions to reduce the “error correction” workload.
Prompt‑engineering tools (templates, autocomplete for prompts) could increase the proportion of direct adoptions, saving developer time.
CI/CD integration – automated tests can be run on LLM‑generated patches before they are merged, catching the few cases where the model hallucinates.
Team workflows – documenting the prompt‑refactoring pattern can help teams standardize how they review AI‑generated changes, ensuring consistency and traceability.
Productivity boost – given that most developers accept suggestions unchanged, LLM‑driven refactoring can be a low‑friction way to improve code readability and maintainability at scale.

Limitations & Future Work

Sample bias – the dataset only includes commits that explicitly link to a ChatGPT conversation, possibly over‑representing developers who are already enthusiastic about AI assistance.
Language & domain scope – the study focuses on a handful of popular languages (e.g., Python, JavaScript); results may differ for systems languages or domain‑specific code.
Temporal snapshot – the analysis reflects a specific version of ChatGPT; rapid model updates could shift adoption patterns.
Future directions suggested by the authors include expanding the dataset to other LLMs, exploring automated detection of low‑quality suggestions, and conducting controlled user studies to measure productivity gains more precisely.

Authors

David Schön
Faiza Amjad
Tehreem Asif
Ranim Khojah
Mazen Mohamad
Francisco Gomes de Oliveira Neto
Philipp Leitner

Paper Information

arXiv ID: 2605.04835v1
Categories: cs.SE, cs.HC
Published: May 6, 2026
PDF: Download PDF

[Paper] Patterns of Developer Adoption of LLM-Generated Code Refactoring Suggestions

Overview

Key Contributions

Methodology

Results & Findings

1. Direct adoption

2. Prompt‑driven refinement

3. Partial integration

4. Error correction

5. Rejection & rewrite

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Collaborator or Assistnat? How AI Coding Agents Partition Work Across Pull Request Lifecycles

[Paper] Similar Pattern Annotation via Retrieval Knowledge for LLM-Based Test Code Fault Localization

[Paper] Evaluating Design Conformance Through Trace Comparison

[Paper] Unsafe by Flow: Uncovering Bidirectional Data-Flow Risks in MCP Ecosystem