[Paper] RepoMod-Bench: A Benchmark for Code Repository Modernization via Implementation-Agnostic Testing

Published: 3 days ago (February 25, 2026 at 08:25 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.22518v1

Overview

The paper introduces RepoMod‑Bench, a new benchmark that measures how well AI‑driven coding agents can modernize entire code repositories. By using implementation‑agnostic (black‑box) tests that are hidden from the agents, the authors provide a deterministic, language‑independent way to evaluate functional equivalence between the original and the modernized code—something that prior benchmarks have struggled to achieve.

Key Contributions

Large‑scale, multi‑language benchmark: 21 real‑world repositories (14 K–211 K LOC) spanning 8 programming languages, totaling 1.6 M LOC and 11 616 tests.
Implementation‑agnostic evaluation: Tests are executed as black‑box binaries, preventing agents from “cheating” by reading or tailoring to the test suite.
Standardized interfaces: Each repo is wrapped with a uniform API, enabling cross‑language functional checks.
Empirical baseline: Four state‑of‑the‑art AI coding agents are evaluated, revealing a dramatic drop in pass rates as repo size grows (91 % → 15 %).
Open‑source release: Benchmark data, test harness, and evaluation scripts are publicly available on GitHub.

Methodology

Repository selection – Real‑world open‑source projects were chosen for diversity (different domains, languages, and sizes) and for having a clear “source” implementation that can serve as ground truth.
Interface standardization – Each repo is wrapped with a thin shim exposing a set of functions (e.g., parse(), serialize()) that are language‑neutral. This lets the same test harness call into any language implementation.
Implementation‑agnostic test suite – Tests are compiled into language‑specific binaries that run the standardized interface and compare outputs against the original repository’s behavior. The test binaries are never exposed to the AI agents.
Agent configurations – Four leading code‑generation agents (e.g., Codex‑based, GPT‑4‑based, and two open‑source models) are prompted to translate or refactor each repository without seeing the tests.
Scoring – A pass‑rate is computed per repository: the fraction of hidden tests that the modernized code satisfies. Results are aggregated across size buckets (<10 K LOC, 10–50 K LOC, >50 K LOC).

Results & Findings

Size bucket	Avg. pass rate (baseline agents)
< 10 K LOC	91.3 %
10–50 K LOC	38.7 %
> 50 K LOC	15.3 %

Scaling collapse: Performance degrades sharply as repository size grows, indicating current agents struggle with large‑scale architectural reasoning.
Language robustness: Pass rates are consistently low across all eight languages, suggesting the bottleneck is not language‑specific but rather the ability to manage complex codebases.
Test‑hiding effectiveness: Since agents never see the tests, the low scores cannot be blamed on overfitting to a visible test suite; they reflect genuine functional gaps.

Practical Implications

Tooling developers: If you’re building AI‑assisted refactoring or migration tools, RepoMod‑Bench offers a realistic yardstick that mimics production constraints (no test visibility, multi‑language support).
CI/CD integration: The black‑box test harness can be dropped into existing pipelines to automatically validate AI‑generated patches before they reach production.
Enterprise migration: Companies looking to modernize legacy monoliths should temper expectations—current models perform well only on small, isolated components.
Model training: The benchmark highlights the need for training data that captures architectural patterns (module boundaries, API contracts) rather than just snippet‑level completions.

Limitations & Future Work

Scope of modernization: The benchmark focuses on functional equivalence; non‑functional aspects (performance, memory usage, security) are not measured.
Static test coverage: While hidden, the test suites are still manually authored and may miss edge cases present in real deployments.
Agent diversity: Only four configurations were evaluated; broader coverage (e.g., fine‑tuned domain‑specific models) could yield different scaling behavior.
Future directions: Extending the benchmark to include performance regressions, adding more massive codebases (>500 K LOC), and exploring semi‑automated test generation to broaden evaluation criteria.

Authors

Xuefeng Li
Nir Ben-Israel
Yotam Raz
Belal Ahmed
Doron Serebro
Antoine Raux

Paper Information

arXiv ID: 2602.22518v1
Categories: cs.SE
Published: February 26, 2026
PDF: Download PDF

[Paper] RepoMod-Bench: A Benchmark for Code Repository Modernization via Implementation-Agnostic Testing

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Array-Carrying Symbolic Execution for Function Contract Generation

[Paper] LLM-Powered Silent Bug Fuzzing in Deep Learning Libraries via Versatile and Controlled Bug Transfer

[Paper] CL4SE: A Context Learning Benchmark For Software Engineering Tasks

[Paper] Managing Uncertainty in LLM-based Multi-Agent System Operation