[Paper] SWE-Refactor: A Repository-Level Benchmark for Real-World LLM-Based Code Refactoring

Published: (February 3, 2026 at 11:36 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.03712v1

Overview

The paper introduces SWE‑Refactor, a new benchmark that captures real‑world, repository‑level refactoring tasks performed by developers on Java codebases. By providing 1,099 carefully vetted refactoring instances, the authors give the community a realistic test‑bed for evaluating how well large language models (LLMs) can suggest semantics‑preserving code edits—a capability that goes far beyond simple code generation.

Key Contributions

  • Large, realistic dataset: 1,099 developer‑authored refactorings mined from 18 open‑source Java projects, split into 922 atomic (single‑step) and 177 compound (multi‑step) edits.
  • Rigorous validation pipeline: Each instance is checked for compile‑time correctness, passes the project’s test suite, and is verified with automated refactoring detection tools to guarantee behavior preservation.
  • Repository‑level context: Benchmarks include surrounding files and build configurations, forcing LLMs to reason about imports, project structure, and cross‑file dependencies.
  • Comprehensive evaluation: Nine popular LLMs (e.g., GPT‑4o‑mini, DeepSeek‑V3, CodeLLaMa, OpenAI Codex) are benchmarked, establishing baseline success rates for both atomic and compound refactorings.
  • Open release: The full dataset, evaluation scripts, and results are publicly released to accelerate research on LLM‑driven code maintenance.

Methodology

  1. Mining Refactorings – The authors mined commit histories of 18 mature Java repositories, extracting commits that contain only refactoring changes (no functional edits).
  2. Filtering & Classification – Refactorings were classified as atomic (single transformation, e.g., rename method) or compound (multiple coordinated changes, e.g., extract class + update call sites).
  3. Validation Suite – For each candidate:
    • The project is rebuilt to ensure the change compiles.
    • All existing unit/integration tests are run to confirm no behavior regression.
    • A static refactoring detection tool (e.g., RefactoringMiner) cross‑checks that the recorded transformation matches the actual code diff.
  4. Prompt Design – Each benchmark instance is turned into a prompt that supplies the relevant file(s), build configuration, and a natural‑language description of the desired refactoring.
  5. LLM Evaluation – The nine LLMs generate patches, which are then automatically applied, compiled, and tested. Success is recorded when the generated patch exactly reproduces the validated refactoring and passes all tests.

Results & Findings

Refactoring typeBest‑performing model (success %)Worst‑performing model (success %)
AtomicGPT‑4o‑mini – 68.2 %DeepSeek‑V3 – 41.5 %
CompoundGPT‑4o‑mini – 42.9 %OpenAI Codex agent – 39.4 %
  • Atomic edits are relatively tractable; top models solve roughly two‑thirds of cases.
  • Compound edits remain challenging: even the best model fails on more than half of them.
  • Failure analysis shows that errors cluster around dependency tracking (missing imports or stale references) and multi‑step coordination (e.g., forgetting to update all call sites after an extraction).
  • The gap between models narrows for simple renames but widens dramatically for structural changes like “extract interface” or “move method to another class.”

Practical Implications

  • Tooling developers can use SWE‑Refactor as a regression suite when building LLM‑powered IDE assistants, ensuring that new model releases improve on real‑world refactoring tasks.
  • DevOps pipelines could eventually integrate LLM suggestions for routine clean‑ups (e.g., renaming, extracting methods) but should treat complex, multi‑file refactorings as “human‑in‑the‑loop” operations until model reliability improves.
  • Training data curators now have a concrete, high‑quality source of behavior‑preserving edit examples, which can be used to fine‑tune or instruction‑tune LLMs for maintenance‑oriented tasks.
  • Project maintainers can benchmark their own code‑review bots against the published baseline, identifying where their custom prompts or post‑processing steps add value.

Limitations & Future Work

  • Language scope – The benchmark currently covers only Java; extending to other ecosystems (Python, JavaScript, Rust) would broaden applicability.
  • Scale of compound edits – Only 177 compound instances are available; a larger, more diverse set could better capture the full spectrum of real‑world refactorings.
  • Human evaluation – Success is measured automatically (compilation + tests). Some subtle semantic changes that escape test suites may go unnoticed. Incorporating expert reviewer judgments could tighten the evaluation.
  • Prompt engineering – The study uses a single prompt format; exploring alternative prompt designs (e.g., chain‑of‑thought, few‑shot examples) might boost performance, especially for compound tasks.

The authors have released SWE‑Refactor and all evaluation artifacts, inviting the community to build on this foundation and push LLMs closer to being trustworthy code‑maintenance partners.

Authors

  • Yisen Xu
  • Jinqiu Yang
  • Tse‑Hsun
  • Chen

Paper Information

  • arXiv ID: 2602.03712v1
  • Categories: cs.SE
  • Published: February 3, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »