[Paper] JMigBench: A Benchmark for Evaluating LLMs on Source Code Migration (Java 8 to Java 11)
Source: arXiv - 2602.09930v1
Overview
The paper introduces JMigBench, a new benchmark designed to measure how well large language models (LLMs) can help developers migrate Java code from version 8 to version 11. By focusing on real‑world API deprecations, the authors provide a concrete way to evaluate whether LLMs can actually reduce the manual effort involved in large‑scale code upgrades.
Key Contributions
- A curated migration dataset: 8 categories of deprecated Java 8 APIs (e.g.,
java.time, CORBA, JAX‑WS) with paired “before‑and‑after” function snippets drawn from open‑source projects. - Benchmark framework: Standardized evaluation pipeline using CodeBLEU and a keyword‑based correctness metric to capture lexical, syntactic, and semantic fidelity of migrated code.
- Empirical study on Mistral Codestral: First‑hand performance numbers for a state‑of‑the‑art LLM on Java migration tasks, highlighting strengths (simple one‑to‑one replacements) and weaknesses (complex API refactorings).
- Open‑source release: All data, evaluation scripts, and analysis notebooks are publicly available, encouraging reproducibility and future extensions.
Methodology
- Data collection – The authors mined thousands of Java repositories on GitHub, extracting function pairs where a commit upgraded code from Java 8 to Java 11.
- Quality filtering – Automatic heuristics (e.g., compile‑time checks, diff size limits) were applied, then manual inspection narrowed the set to high‑confidence examples across eight deprecation categories.
- Prompt design – Each Java 8 function was fed to the LLM with a concise instruction (“Migrate this method to Java 11”). No additional context (e.g., surrounding class) was provided, mimicking a realistic “copy‑paste” usage scenario.
- Evaluation metrics –
- CodeBLEU: measures token‑level similarity while rewarding correct syntax and structure.
- Keyword‑based correctness: checks whether required new APIs (e.g.,
java.time.LocalDate) appear and deprecated ones disappear. - Exact match rate: percentage of migrations that are identical to the reference implementation.
Results & Findings
| Category | Exact‑match (identical) | CodeBLEU (avg.) | Keyword‑correct |
|---|---|---|---|
Simple API swaps (e.g., Date → LocalDate) | 11.1 % | 0.68 | 0.73 |
| CORBA / JAX‑WS (complex refactor) | < 2 % | 0.42 | 0.48 |
Miscellaneous (e.g., Stream updates) | 5 % | 0.55 | 0.60 |
- Trivial replacements: The model reliably swaps deprecated classes for their modern equivalents when the change is a direct one‑to‑one mapping.
- Complex migrations: When the upgrade requires architectural changes (e.g., moving from CORBA to REST‑style services), the model often produces syntactically valid but semantically incorrect code.
- Overall: Mistral Codestral can partially automate repetitive migration steps, but human review remains essential for anything beyond straightforward API swaps.
Practical Implications
- Developer tooling: IDE plugins could integrate an LLM‑based “quick‑fix” that automatically rewrites simple deprecated calls, shaving minutes off large upgrade tickets.
- CI/CD pipelines: Automated migration scripts powered by LLMs can generate a first draft of updated code, which is then validated by static analysis tools before merging.
- Cost‑benefit: For enterprises with massive Java 8 codebases, the benchmark suggests a 10‑15 % reduction in manual migration effort for low‑complexity APIs—translating to measurable time savings.
- Training data for custom models: The curated dataset can serve as fine‑tuning material for domain‑specific LLMs, potentially boosting performance on the harder categories.
Limitations & Future Work
- Dataset scope: JMigBench covers only eight deprecation categories and function‑level snippets; larger class‑ or module‑level migrations are not represented.
- Prompt simplicity: Real‑world developers often provide richer context (imports, surrounding code). More sophisticated prompting could improve results.
- Model diversity: The study evaluates a single LLM (Mistral Codestral). Extending the benchmark to other models (e.g., GPT‑4, Claude) will clarify whether the observed gaps are model‑specific or inherent to current LLM capabilities.
- Semantic verification: The current metrics focus on lexical similarity; integrating runtime tests or type‑checking could give a more accurate picture of functional correctness.
Bottom line: JMigBench gives the community a concrete yardstick for measuring LLM‑driven code migration. While early results are promising for simple API updates, there’s still a long road before LLMs can fully replace human expertise in complex Java upgrades. The benchmark itself is a valuable resource for anyone building the next generation of AI‑assisted developer tools.
Authors
- Nishil Amin
- Zhiwei Fei
- Xiang Li
- Justyna Petke
- He Ye
Paper Information
- arXiv ID: 2602.09930v1
- Categories: cs.SE
- Published: February 10, 2026
- PDF: Download PDF