[Paper] JMigBench: A Benchmark for Evaluating LLMs on Source Code Migration (Java 8 to Java 11)

Published: 2 days ago (February 10, 2026 at 11:04 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.09930v1

Overview

The paper introduces JMigBench, a new benchmark designed to measure how well large language models (LLMs) can help developers migrate Java code from version 8 to version 11. By focusing on real‑world API deprecations, the authors provide a concrete way to evaluate whether LLMs can actually reduce the manual effort involved in large‑scale code upgrades.

Key Contributions

A curated migration dataset: 8 categories of deprecated Java 8 APIs (e.g., java.time, CORBA, JAX‑WS) with paired “before‑and‑after” function snippets drawn from open‑source projects.
Benchmark framework: Standardized evaluation pipeline using CodeBLEU and a keyword‑based correctness metric to capture lexical, syntactic, and semantic fidelity of migrated code.
Empirical study on Mistral Codestral: First‑hand performance numbers for a state‑of‑the‑art LLM on Java migration tasks, highlighting strengths (simple one‑to‑one replacements) and weaknesses (complex API refactorings).
Open‑source release: All data, evaluation scripts, and analysis notebooks are publicly available, encouraging reproducibility and future extensions.

Methodology

Data collection – The authors mined thousands of Java repositories on GitHub, extracting function pairs where a commit upgraded code from Java 8 to Java 11.
Quality filtering – Automatic heuristics (e.g., compile‑time checks, diff size limits) were applied, then manual inspection narrowed the set to high‑confidence examples across eight deprecation categories.
Prompt design – Each Java 8 function was fed to the LLM with a concise instruction (“Migrate this method to Java 11”). No additional context (e.g., surrounding class) was provided, mimicking a realistic “copy‑paste” usage scenario.
Evaluation metrics –
- CodeBLEU: measures token‑level similarity while rewarding correct syntax and structure.
- Keyword‑based correctness: checks whether required new APIs (e.g., java.time.LocalDate) appear and deprecated ones disappear.
- Exact match rate: percentage of migrations that are identical to the reference implementation.

Results & Findings

Category	Exact‑match (identical)	CodeBLEU (avg.)	Keyword‑correct
Simple API swaps (e.g., `Date → LocalDate`)	11.1 %	0.68	0.73
CORBA / JAX‑WS (complex refactor)	< 2 %	0.42	0.48
Miscellaneous (e.g., `Stream` updates)	5 %	0.55	0.60

Trivial replacements: The model reliably swaps deprecated classes for their modern equivalents when the change is a direct one‑to‑one mapping.
Complex migrations: When the upgrade requires architectural changes (e.g., moving from CORBA to REST‑style services), the model often produces syntactically valid but semantically incorrect code.
Overall: Mistral Codestral can partially automate repetitive migration steps, but human review remains essential for anything beyond straightforward API swaps.

Practical Implications

Developer tooling: IDE plugins could integrate an LLM‑based “quick‑fix” that automatically rewrites simple deprecated calls, shaving minutes off large upgrade tickets.
CI/CD pipelines: Automated migration scripts powered by LLMs can generate a first draft of updated code, which is then validated by static analysis tools before merging.
Cost‑benefit: For enterprises with massive Java 8 codebases, the benchmark suggests a 10‑15 % reduction in manual migration effort for low‑complexity APIs—translating to measurable time savings.
Training data for custom models: The curated dataset can serve as fine‑tuning material for domain‑specific LLMs, potentially boosting performance on the harder categories.

Limitations & Future Work

Dataset scope: JMigBench covers only eight deprecation categories and function‑level snippets; larger class‑ or module‑level migrations are not represented.
Prompt simplicity: Real‑world developers often provide richer context (imports, surrounding code). More sophisticated prompting could improve results.
Model diversity: The study evaluates a single LLM (Mistral Codestral). Extending the benchmark to other models (e.g., GPT‑4, Claude) will clarify whether the observed gaps are model‑specific or inherent to current LLM capabilities.
Semantic verification: The current metrics focus on lexical similarity; integrating runtime tests or type‑checking could give a more accurate picture of functional correctness.

Bottom line: JMigBench gives the community a concrete yardstick for measuring LLM‑driven code migration. While early results are promising for simple API updates, there’s still a long road before LLMs can fully replace human expertise in complex Java upgrades. The benchmark itself is a valuable resource for anyone building the next generation of AI‑assisted developer tools.

Authors

Nishil Amin
Zhiwei Fei
Xiang Li
Justyna Petke
He Ye

Paper Information

arXiv ID: 2602.09930v1
Categories: cs.SE
Published: February 10, 2026
PDF: Download PDF

[Paper] JMigBench: A Benchmark for Evaluating LLMs on Source Code Migration (Java 8 to Java 11)

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Automated Test Suite Enhancement Using Large Language Models with Few-shot Prompting

[Paper] Unknown Attack Detection in IoT Networks using Large Language Models: A Robust, Data-efficient Approach

[Paper] PPTAM$η$: Energy Aware CI/CD Pipeline for Container Based Applications

[Paper] Performance Antipatterns: Angel or Devil for Power Consumption?