[Paper] From Restructuring to Stabilization: A Large-Scale Experiment on Iterative Code Readability Refactoring with Large Language Models

Published: 3 days ago (February 25, 2026 at 07:05 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.21833v1

Overview

The paper investigates how large language models (LLMs) can be used to automatically refactor Java code for better readability. By running a massive, multi‑iteration experiment with GPT‑5.1 on 230 code snippets, the authors uncover how LLM‑driven refactoring evolves from chaotic restructuring to a stable, “optimally readable” form.

Key Contributions

Large‑scale empirical study of iterative readability‑focused refactoring using a state‑of‑the‑art LLM (GPT‑5.1).
Three‑phase taxonomy of code changes (implementation, syntactic, comment‑level) that captures fine‑grained transformations during refactoring.
Discovery of a convergence pattern: early iterations aggressively restructure code, later iterations stabilize toward a consistent, readable version.
Evidence that convergence is robust across different code variants and only mildly affected by prompting style.
Open dataset & experimental pipeline that other researchers and tool builders can reuse for comparative studies.

Methodology

Snippet selection & variation – 230 Java snippets were collected from open‑source projects. Each snippet was automatically mutated to create multiple “variants” (e.g., renamed variables, reordered statements) to test robustness.
Prompting strategies – Three prompt designs were compared:
- Baseline: “Refactor this code for readability.”
- Factor‑specific: “Improve naming and comments.”
- Iterative: “Apply the previous refactoring again.”
Iterative refactoring loop – For each snippet‑variant, the LLM was invoked five times in a row, feeding the previous output back as input.
Change classification – The authors diffed successive versions and labeled edits as:
- Implementation (logic restructuring, extraction of methods)
- Syntactic (formatting, whitespace, ordering)
- Comment‑level (adding/modifying documentation).
Correctness check – Unit tests accompanying the snippets were run after each iteration to ensure functional behavior was preserved.
Generalization test – A held‑out set of “novel” snippets was processed to see whether the observed patterns held beyond the training data.

Results & Findings

Observation	What the data showed
Restructuring → Stabilization	The first 2‑3 iterations produced large, often disruptive edits (e.g., method extraction, variable renaming). By iteration 4‑5, changes shrank to minor formatting or comment tweaks, indicating convergence.
Consistency across variants	Even when input code was heavily perturbed, the LLM still followed the same convergence curve, suggesting an internal “readability optimum.”
Prompt influence	Explicitly asking for specific readability factors nudged the LLM toward more comment‑level edits, but did not dramatically alter the overall convergence timeline.
Functional safety	Over 96 % of the refactored snippets passed their original unit tests after each iteration, showing that readability improvements rarely broke functionality.
Robustness on novel code	The same convergence pattern emerged on unseen snippets, reinforcing that the behavior is model‑wide rather than dataset‑specific.

Practical Implications

Tooling confidence – Developers can integrate LLM‑based refactoring assistants into CI pipelines, knowing that after a few passes the model will settle into a stable, readable version without repeatedly breaking the code.
Iterative workflow – Instead of a single “run‑once” refactor, an iterative approach (e.g., 3‑4 passes) yields better, more predictable outcomes.
Prompt design guidance – Simple “make it more readable” prompts work well; adding explicit factors only fine‑tunes the result, so teams can keep prompts lightweight.
Quality gates – Since functional correctness is largely preserved, teams can add a “readability convergence” gate that stops after the model’s changes fall below a size threshold.
Cross‑model benchmarking – The released dataset enables developers to compare different LLMs (Claude, Gemini, etc.) on the same readability task, helping choose the most cost‑effective option for their stack.

Limitations & Future Work

Model specificity – Experiments were limited to GPT‑5.1; other architectures may exhibit different convergence speeds or quality.
Language scope – Only Java snippets were examined; results may not transfer directly to dynamically typed languages or those with different idioms (e.g., Python, JavaScript).
Readability metrics – The study relied on human‑interpreted readability improvements; future work could integrate objective metrics (e.g., cyclomatic complexity, Halstead) to quantify gains.
Long‑term maintenance – The impact of LLM‑refactored code on future manual edits or on downstream tools (static analysis, code review bots) remains an open question.

Bottom line: The research shows that LLMs can reliably “clean up” code through a few iterative passes, offering a practical, low‑risk way for development teams to boost readability without sacrificing correctness.

Authors

Norman Peitek
Julia Hess
Sven Apel

Paper Information

arXiv ID: 2602.21833v1
Categories: cs.SE
Published: February 25, 2026
PDF: Download PDF

[Paper] From Restructuring to Stabilization: A Large-Scale Experiment on Iterative Code Readability Refactoring with Large Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Array-Carrying Symbolic Execution for Function Contract Generation

[Paper] LLM-Powered Silent Bug Fuzzing in Deep Learning Libraries via Versatile and Controlled Bug Transfer

[Paper] CL4SE: A Context Learning Benchmark For Software Engineering Tasks

[Paper] Managing Uncertainty in LLM-based Multi-Agent System Operation