[Paper] JMigBench: A Benchmark for Evaluating LLMs on Source Code Migration (Java 8 to Java 11)

Published: (February 10, 2026 at 11:04 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.09930v1

Overview

The paper introduces JMigBench, a new benchmark designed to measure how well large language models (LLMs) can help developers migrate Java code from version 8 to version 11. By focusing on real‑world API deprecations, the authors provide a concrete way to evaluate whether LLMs can actually reduce the manual effort involved in large‑scale code upgrades.

Key Contributions

  • A curated migration dataset: 8 categories of deprecated Java 8 APIs (e.g., java.time, CORBA, JAX‑WS) with paired “before‑and‑after” function snippets drawn from open‑source projects.
  • Benchmark framework: Standardized evaluation pipeline using CodeBLEU and a keyword‑based correctness metric to capture lexical, syntactic, and semantic fidelity of migrated code.
  • Empirical study on Mistral Codestral: First‑hand performance numbers for a state‑of‑the‑art LLM on Java migration tasks, highlighting strengths (simple one‑to‑one replacements) and weaknesses (complex API refactorings).
  • Open‑source release: All data, evaluation scripts, and analysis notebooks are publicly available, encouraging reproducibility and future extensions.

Methodology

  1. Data collection – The authors mined thousands of Java repositories on GitHub, extracting function pairs where a commit upgraded code from Java 8 to Java 11.
  2. Quality filtering – Automatic heuristics (e.g., compile‑time checks, diff size limits) were applied, then manual inspection narrowed the set to high‑confidence examples across eight deprecation categories.
  3. Prompt design – Each Java 8 function was fed to the LLM with a concise instruction (“Migrate this method to Java 11”). No additional context (e.g., surrounding class) was provided, mimicking a realistic “copy‑paste” usage scenario.
  4. Evaluation metrics
    • CodeBLEU: measures token‑level similarity while rewarding correct syntax and structure.
    • Keyword‑based correctness: checks whether required new APIs (e.g., java.time.LocalDate) appear and deprecated ones disappear.
    • Exact match rate: percentage of migrations that are identical to the reference implementation.

Results & Findings

CategoryExact‑match (identical)CodeBLEU (avg.)Keyword‑correct
Simple API swaps (e.g., Date → LocalDate)11.1 %0.680.73
CORBA / JAX‑WS (complex refactor)< 2 %0.420.48
Miscellaneous (e.g., Stream updates)5 %0.550.60
  • Trivial replacements: The model reliably swaps deprecated classes for their modern equivalents when the change is a direct one‑to‑one mapping.
  • Complex migrations: When the upgrade requires architectural changes (e.g., moving from CORBA to REST‑style services), the model often produces syntactically valid but semantically incorrect code.
  • Overall: Mistral Codestral can partially automate repetitive migration steps, but human review remains essential for anything beyond straightforward API swaps.

Practical Implications

  • Developer tooling: IDE plugins could integrate an LLM‑based “quick‑fix” that automatically rewrites simple deprecated calls, shaving minutes off large upgrade tickets.
  • CI/CD pipelines: Automated migration scripts powered by LLMs can generate a first draft of updated code, which is then validated by static analysis tools before merging.
  • Cost‑benefit: For enterprises with massive Java 8 codebases, the benchmark suggests a 10‑15 % reduction in manual migration effort for low‑complexity APIs—translating to measurable time savings.
  • Training data for custom models: The curated dataset can serve as fine‑tuning material for domain‑specific LLMs, potentially boosting performance on the harder categories.

Limitations & Future Work

  • Dataset scope: JMigBench covers only eight deprecation categories and function‑level snippets; larger class‑ or module‑level migrations are not represented.
  • Prompt simplicity: Real‑world developers often provide richer context (imports, surrounding code). More sophisticated prompting could improve results.
  • Model diversity: The study evaluates a single LLM (Mistral Codestral). Extending the benchmark to other models (e.g., GPT‑4, Claude) will clarify whether the observed gaps are model‑specific or inherent to current LLM capabilities.
  • Semantic verification: The current metrics focus on lexical similarity; integrating runtime tests or type‑checking could give a more accurate picture of functional correctness.

Bottom line: JMigBench gives the community a concrete yardstick for measuring LLM‑driven code migration. While early results are promising for simple API updates, there’s still a long road before LLMs can fully replace human expertise in complex Java upgrades. The benchmark itself is a valuable resource for anyone building the next generation of AI‑assisted developer tools.

Authors

  • Nishil Amin
  • Zhiwei Fei
  • Xiang Li
  • Justyna Petke
  • He Ye

Paper Information

  • arXiv ID: 2602.09930v1
  • Categories: cs.SE
  • Published: February 10, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »