[Paper] From Restructuring to Stabilization: A Large-Scale Experiment on Iterative Code Readability Refactoring with Large Language Models

Published: (February 25, 2026 at 07:05 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.21833v1

Overview

The paper investigates how large language models (LLMs) can be used to automatically refactor Java code for better readability. By running a massive, multi‑iteration experiment with GPT‑5.1 on 230 code snippets, the authors uncover how LLM‑driven refactoring evolves from chaotic restructuring to a stable, “optimally readable” form.

Key Contributions

  • Large‑scale empirical study of iterative readability‑focused refactoring using a state‑of‑the‑art LLM (GPT‑5.1).
  • Three‑phase taxonomy of code changes (implementation, syntactic, comment‑level) that captures fine‑grained transformations during refactoring.
  • Discovery of a convergence pattern: early iterations aggressively restructure code, later iterations stabilize toward a consistent, readable version.
  • Evidence that convergence is robust across different code variants and only mildly affected by prompting style.
  • Open dataset & experimental pipeline that other researchers and tool builders can reuse for comparative studies.

Methodology

  1. Snippet selection & variation – 230 Java snippets were collected from open‑source projects. Each snippet was automatically mutated to create multiple “variants” (e.g., renamed variables, reordered statements) to test robustness.
  2. Prompting strategies – Three prompt designs were compared:
    • Baseline: “Refactor this code for readability.”
    • Factor‑specific: “Improve naming and comments.”
    • Iterative: “Apply the previous refactoring again.”
  3. Iterative refactoring loop – For each snippet‑variant, the LLM was invoked five times in a row, feeding the previous output back as input.
  4. Change classification – The authors diffed successive versions and labeled edits as:
    • Implementation (logic restructuring, extraction of methods)
    • Syntactic (formatting, whitespace, ordering)
    • Comment‑level (adding/modifying documentation).
  5. Correctness check – Unit tests accompanying the snippets were run after each iteration to ensure functional behavior was preserved.
  6. Generalization test – A held‑out set of “novel” snippets was processed to see whether the observed patterns held beyond the training data.

Results & Findings

ObservationWhat the data showed
Restructuring → StabilizationThe first 2‑3 iterations produced large, often disruptive edits (e.g., method extraction, variable renaming). By iteration 4‑5, changes shrank to minor formatting or comment tweaks, indicating convergence.
Consistency across variantsEven when input code was heavily perturbed, the LLM still followed the same convergence curve, suggesting an internal “readability optimum.”
Prompt influenceExplicitly asking for specific readability factors nudged the LLM toward more comment‑level edits, but did not dramatically alter the overall convergence timeline.
Functional safetyOver 96 % of the refactored snippets passed their original unit tests after each iteration, showing that readability improvements rarely broke functionality.
Robustness on novel codeThe same convergence pattern emerged on unseen snippets, reinforcing that the behavior is model‑wide rather than dataset‑specific.

Practical Implications

  • Tooling confidence – Developers can integrate LLM‑based refactoring assistants into CI pipelines, knowing that after a few passes the model will settle into a stable, readable version without repeatedly breaking the code.
  • Iterative workflow – Instead of a single “run‑once” refactor, an iterative approach (e.g., 3‑4 passes) yields better, more predictable outcomes.
  • Prompt design guidance – Simple “make it more readable” prompts work well; adding explicit factors only fine‑tunes the result, so teams can keep prompts lightweight.
  • Quality gates – Since functional correctness is largely preserved, teams can add a “readability convergence” gate that stops after the model’s changes fall below a size threshold.
  • Cross‑model benchmarking – The released dataset enables developers to compare different LLMs (Claude, Gemini, etc.) on the same readability task, helping choose the most cost‑effective option for their stack.

Limitations & Future Work

  • Model specificity – Experiments were limited to GPT‑5.1; other architectures may exhibit different convergence speeds or quality.
  • Language scope – Only Java snippets were examined; results may not transfer directly to dynamically typed languages or those with different idioms (e.g., Python, JavaScript).
  • Readability metrics – The study relied on human‑interpreted readability improvements; future work could integrate objective metrics (e.g., cyclomatic complexity, Halstead) to quantify gains.
  • Long‑term maintenance – The impact of LLM‑refactored code on future manual edits or on downstream tools (static analysis, code review bots) remains an open question.

Bottom line: The research shows that LLMs can reliably “clean up” code through a few iterative passes, offering a practical, low‑risk way for development teams to boost readability without sacrificing correctness.

Authors

  • Norman Peitek
  • Julia Hess
  • Sven Apel

Paper Information

  • arXiv ID: 2602.21833v1
  • Categories: cs.SE
  • Published: February 25, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »