[Paper] CodeTaste: Can LLMs Generate Human-Level Code Refactorings?

Published: 1 day ago (March 4, 2026 at 10:34 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2603.04177v1

Overview

The paper CodeTaste asks a simple yet powerful question: Can today’s large language model (LLM) coding assistants perform refactorings at the level of a human developer? By mining real‑world multi‑file refactoring commits from open‑source projects, the authors build a benchmark that lets us measure how reliably LLMs can clean up code, remove duplication, and improve architecture without breaking functionality.

Key Contributions

CodeTaste benchmark – a curated dataset of 1,200+ real refactoring tasks extracted from popular GitHub repositories, each paired with the original test suite and a set of static “pattern” checks.
Hybrid evaluation metric – combines passing the repository’s unit tests with automated static analyses that verify the removal of unwanted code patterns and the introduction of desired ones, using data‑flow reasoning.
Empirical study of frontier LLMs – evaluates GPT‑4, Claude‑2, and Llama‑2‑70B on two modes: (1) detailed refactoring instructions, and (2) open‑ended improvement prompts.
Propose‑then‑implement workflow – shows that letting the model first suggest a refactoring plan, then selecting the best‑aligned proposal before implementation, significantly narrows the performance gap.
Insight into human‑LLM alignment – reveals that current models excel when the refactoring goal is explicitly spelled out, but struggle to infer the “right” human‑chosen transformation from vague improvement hints.

Methodology

Data collection – The authors mined the GitHub History API for commits that (a) touched multiple files, (b) passed all tests before and after the change, and (c) were labeled with refactoring‑related keywords (e.g., “extract method”, “rename variable”).
Task formulation – Each commit becomes a task: the pre‑refactor code is the prompt, and the post‑refactor code is the ground truth. Two prompt styles are used:
- Detailed – “Extract the duplicated logic in foo() into a helper method called processBar.”
- Open‑ended – “Improve the design of foo().”
LLM interaction – Models are queried via the OpenAI/Anthropic APIs with temperature = 0.2. For the propose‑then‑implement pipeline, the model first generates up to three refactoring proposals, a lightweight static matcher scores each proposal, and the highest‑scoring one is fed back for concrete code generation.
Evaluation – A solution passes if (i) the repository’s test suite still succeeds, and (ii) all static checks (e.g., “no duplicated if branches”, “no dead code”) are satisfied. The authors report pass rate and pattern‑precision (how many intended patterns were correctly introduced/removed).

Results & Findings

Model	Detailed Prompt Pass Rate	Open‑ended Prompt Pass Rate
GPT‑4	78 %	42 %
Claude‑2	71 %	38 %
Llama‑2‑70B	55 %	27 %

Detailed instructions → high success: When the refactoring goal is spelled out, even the largest open‑source model (Llama‑2‑70B) reaches >50 % pass rate.
Open‑ended prompts → steep drop: All models struggle to infer the exact human‑chosen transformation, confirming a gap between “can refactor” and “can guess the right refactor”.
Propose‑then‑implement gains: Adding a proposal selection step lifts GPT‑4’s open‑ended pass rate from 42 % to 58 %, and Claude‑2 from 38 % to 53 %.
Pattern precision: When a model succeeds, it correctly introduces the desired patterns 92 % of the time, showing that failures are usually due to missed test coverage rather than malformed refactorings.

Practical Implications

IDE assistants can safely automate well‑specified refactorings (e.g., “extract method”, “rename variable”) today, reducing boilerplate for developers.
Code review bots could use a propose‑then‑implement loop to suggest multiple refactoring options, letting a human pick the most appropriate one—much like a “refactoring wizard” powered by LLMs.
Continuous integration pipelines can incorporate CodeTaste‑style static checks to verify that AI‑generated patches don’t re‑introduce anti‑patterns, providing a safety net before merging.
Enterprise tooling may start exposing a “refactor‑by‑prompt” API where developers give a high‑level goal (“reduce cyclomatic complexity in module X”) and the system returns a ranked list of concrete changes, accelerating legacy code modernization.

Limitations & Future Work

Dataset bias – CodeTaste focuses on Java projects with strong test suites; results may differ for dynamically typed languages or poorly tested codebases.
Static checks coverage – The pattern library captures common smells but cannot represent every architectural decision (e.g., design‑pattern migrations).
Model size vs. cost – The best results come from proprietary models (GPT‑4, Claude‑2); replicating them with open‑source LLMs remains expensive.
Human intent inference – The biggest open challenge is teaching models to infer why a developer wants a change, not just what to change. Future work could blend LLMs with program synthesis or reinforcement learning from human feedback to close this gap.

Authors

Alex Thillen
Niels Mündler
Veselin Raychev
Martin Vechev

Paper Information

arXiv ID: 2603.04177v1
Categories: cs.SE, cs.AI, cs.LG
Published: March 4, 2026
PDF: Download PDF

[Paper] CodeTaste: Can LLMs Generate Human-Level Code Refactorings?

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] RoboPocket: Improve Robot Policies Instantly with Your Phone

[Paper] POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation

[Paper] The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

[Paper] Cheap Thrills: Effective Amortized Optimization Using Inexpensive Labels