[Paper] From Restructuring to Stabilization: A Large-Scale Experiment on Iterative Code Readability Refactoring with Large Language Models
Source: arXiv - 2602.21833v1
Overview
The paper investigates how large language models (LLMs) can be used to automatically refactor Java code for better readability. By running a massive, multi‑iteration experiment with GPT‑5.1 on 230 code snippets, the authors uncover how LLM‑driven refactoring evolves from chaotic restructuring to a stable, “optimally readable” form.
Key Contributions
- Large‑scale empirical study of iterative readability‑focused refactoring using a state‑of‑the‑art LLM (GPT‑5.1).
- Three‑phase taxonomy of code changes (implementation, syntactic, comment‑level) that captures fine‑grained transformations during refactoring.
- Discovery of a convergence pattern: early iterations aggressively restructure code, later iterations stabilize toward a consistent, readable version.
- Evidence that convergence is robust across different code variants and only mildly affected by prompting style.
- Open dataset & experimental pipeline that other researchers and tool builders can reuse for comparative studies.
Methodology
- Snippet selection & variation – 230 Java snippets were collected from open‑source projects. Each snippet was automatically mutated to create multiple “variants” (e.g., renamed variables, reordered statements) to test robustness.
- Prompting strategies – Three prompt designs were compared:
- Baseline: “Refactor this code for readability.”
- Factor‑specific: “Improve naming and comments.”
- Iterative: “Apply the previous refactoring again.”
- Iterative refactoring loop – For each snippet‑variant, the LLM was invoked five times in a row, feeding the previous output back as input.
- Change classification – The authors diffed successive versions and labeled edits as:
- Implementation (logic restructuring, extraction of methods)
- Syntactic (formatting, whitespace, ordering)
- Comment‑level (adding/modifying documentation).
- Correctness check – Unit tests accompanying the snippets were run after each iteration to ensure functional behavior was preserved.
- Generalization test – A held‑out set of “novel” snippets was processed to see whether the observed patterns held beyond the training data.
Results & Findings
| Observation | What the data showed |
|---|---|
| Restructuring → Stabilization | The first 2‑3 iterations produced large, often disruptive edits (e.g., method extraction, variable renaming). By iteration 4‑5, changes shrank to minor formatting or comment tweaks, indicating convergence. |
| Consistency across variants | Even when input code was heavily perturbed, the LLM still followed the same convergence curve, suggesting an internal “readability optimum.” |
| Prompt influence | Explicitly asking for specific readability factors nudged the LLM toward more comment‑level edits, but did not dramatically alter the overall convergence timeline. |
| Functional safety | Over 96 % of the refactored snippets passed their original unit tests after each iteration, showing that readability improvements rarely broke functionality. |
| Robustness on novel code | The same convergence pattern emerged on unseen snippets, reinforcing that the behavior is model‑wide rather than dataset‑specific. |
Practical Implications
- Tooling confidence – Developers can integrate LLM‑based refactoring assistants into CI pipelines, knowing that after a few passes the model will settle into a stable, readable version without repeatedly breaking the code.
- Iterative workflow – Instead of a single “run‑once” refactor, an iterative approach (e.g., 3‑4 passes) yields better, more predictable outcomes.
- Prompt design guidance – Simple “make it more readable” prompts work well; adding explicit factors only fine‑tunes the result, so teams can keep prompts lightweight.
- Quality gates – Since functional correctness is largely preserved, teams can add a “readability convergence” gate that stops after the model’s changes fall below a size threshold.
- Cross‑model benchmarking – The released dataset enables developers to compare different LLMs (Claude, Gemini, etc.) on the same readability task, helping choose the most cost‑effective option for their stack.
Limitations & Future Work
- Model specificity – Experiments were limited to GPT‑5.1; other architectures may exhibit different convergence speeds or quality.
- Language scope – Only Java snippets were examined; results may not transfer directly to dynamically typed languages or those with different idioms (e.g., Python, JavaScript).
- Readability metrics – The study relied on human‑interpreted readability improvements; future work could integrate objective metrics (e.g., cyclomatic complexity, Halstead) to quantify gains.
- Long‑term maintenance – The impact of LLM‑refactored code on future manual edits or on downstream tools (static analysis, code review bots) remains an open question.
Bottom line: The research shows that LLMs can reliably “clean up” code through a few iterative passes, offering a practical, low‑risk way for development teams to boost readability without sacrificing correctness.
Authors
- Norman Peitek
- Julia Hess
- Sven Apel
Paper Information
- arXiv ID: 2602.21833v1
- Categories: cs.SE
- Published: February 25, 2026
- PDF: Download PDF