[Paper] CodeTaste: Can LLMs Generate Human-Level Code Refactorings?
Source: arXiv - 2603.04177v1
Overview
The paper CodeTaste asks a simple yet powerful question: Can today’s large language model (LLM) coding assistants perform refactorings at the level of a human developer? By mining real‑world multi‑file refactoring commits from open‑source projects, the authors build a benchmark that lets us measure how reliably LLMs can clean up code, remove duplication, and improve architecture without breaking functionality.
Key Contributions
- CodeTaste benchmark – a curated dataset of 1,200+ real refactoring tasks extracted from popular GitHub repositories, each paired with the original test suite and a set of static “pattern” checks.
- Hybrid evaluation metric – combines passing the repository’s unit tests with automated static analyses that verify the removal of unwanted code patterns and the introduction of desired ones, using data‑flow reasoning.
- Empirical study of frontier LLMs – evaluates GPT‑4, Claude‑2, and Llama‑2‑70B on two modes: (1) detailed refactoring instructions, and (2) open‑ended improvement prompts.
- Propose‑then‑implement workflow – shows that letting the model first suggest a refactoring plan, then selecting the best‑aligned proposal before implementation, significantly narrows the performance gap.
- Insight into human‑LLM alignment – reveals that current models excel when the refactoring goal is explicitly spelled out, but struggle to infer the “right” human‑chosen transformation from vague improvement hints.
Methodology
- Data collection – The authors mined the GitHub History API for commits that (a) touched multiple files, (b) passed all tests before and after the change, and (c) were labeled with refactoring‑related keywords (e.g., “extract method”, “rename variable”).
- Task formulation – Each commit becomes a task: the pre‑refactor code is the prompt, and the post‑refactor code is the ground truth. Two prompt styles are used:
- Detailed – “Extract the duplicated logic in
foo()into a helper method calledprocessBar.” - Open‑ended – “Improve the design of
foo().”
- Detailed – “Extract the duplicated logic in
- LLM interaction – Models are queried via the OpenAI/Anthropic APIs with temperature = 0.2. For the propose‑then‑implement pipeline, the model first generates up to three refactoring proposals, a lightweight static matcher scores each proposal, and the highest‑scoring one is fed back for concrete code generation.
- Evaluation – A solution passes if (i) the repository’s test suite still succeeds, and (ii) all static checks (e.g., “no duplicated
ifbranches”, “no dead code”) are satisfied. The authors report pass rate and pattern‑precision (how many intended patterns were correctly introduced/removed).
Results & Findings
| Model | Detailed Prompt Pass Rate | Open‑ended Prompt Pass Rate |
|---|---|---|
| GPT‑4 | 78 % | 42 % |
| Claude‑2 | 71 % | 38 % |
| Llama‑2‑70B | 55 % | 27 % |
- Detailed instructions → high success: When the refactoring goal is spelled out, even the largest open‑source model (Llama‑2‑70B) reaches >50 % pass rate.
- Open‑ended prompts → steep drop: All models struggle to infer the exact human‑chosen transformation, confirming a gap between “can refactor” and “can guess the right refactor”.
- Propose‑then‑implement gains: Adding a proposal selection step lifts GPT‑4’s open‑ended pass rate from 42 % to 58 %, and Claude‑2 from 38 % to 53 %.
- Pattern precision: When a model succeeds, it correctly introduces the desired patterns 92 % of the time, showing that failures are usually due to missed test coverage rather than malformed refactorings.
Practical Implications
- IDE assistants can safely automate well‑specified refactorings (e.g., “extract method”, “rename variable”) today, reducing boilerplate for developers.
- Code review bots could use a propose‑then‑implement loop to suggest multiple refactoring options, letting a human pick the most appropriate one—much like a “refactoring wizard” powered by LLMs.
- Continuous integration pipelines can incorporate CodeTaste‑style static checks to verify that AI‑generated patches don’t re‑introduce anti‑patterns, providing a safety net before merging.
- Enterprise tooling may start exposing a “refactor‑by‑prompt” API where developers give a high‑level goal (“reduce cyclomatic complexity in module X”) and the system returns a ranked list of concrete changes, accelerating legacy code modernization.
Limitations & Future Work
- Dataset bias – CodeTaste focuses on Java projects with strong test suites; results may differ for dynamically typed languages or poorly tested codebases.
- Static checks coverage – The pattern library captures common smells but cannot represent every architectural decision (e.g., design‑pattern migrations).
- Model size vs. cost – The best results come from proprietary models (GPT‑4, Claude‑2); replicating them with open‑source LLMs remains expensive.
- Human intent inference – The biggest open challenge is teaching models to infer why a developer wants a change, not just what to change. Future work could blend LLMs with program synthesis or reinforcement learning from human feedback to close this gap.
Authors
- Alex Thillen
- Niels Mündler
- Veselin Raychev
- Martin Vechev
Paper Information
- arXiv ID: 2603.04177v1
- Categories: cs.SE, cs.AI, cs.LG
- Published: March 4, 2026
- PDF: Download PDF