[Paper] RepoMod-Bench: A Benchmark for Code Repository Modernization via Implementation-Agnostic Testing
Source: arXiv - 2602.22518v1
Overview
The paper introduces RepoMod‑Bench, a new benchmark that measures how well AI‑driven coding agents can modernize entire code repositories. By using implementation‑agnostic (black‑box) tests that are hidden from the agents, the authors provide a deterministic, language‑independent way to evaluate functional equivalence between the original and the modernized code—something that prior benchmarks have struggled to achieve.
Key Contributions
- Large‑scale, multi‑language benchmark: 21 real‑world repositories (14 K–211 K LOC) spanning 8 programming languages, totaling 1.6 M LOC and 11 616 tests.
- Implementation‑agnostic evaluation: Tests are executed as black‑box binaries, preventing agents from “cheating” by reading or tailoring to the test suite.
- Standardized interfaces: Each repo is wrapped with a uniform API, enabling cross‑language functional checks.
- Empirical baseline: Four state‑of‑the‑art AI coding agents are evaluated, revealing a dramatic drop in pass rates as repo size grows (91 % → 15 %).
- Open‑source release: Benchmark data, test harness, and evaluation scripts are publicly available on GitHub.
Methodology
- Repository selection – Real‑world open‑source projects were chosen for diversity (different domains, languages, and sizes) and for having a clear “source” implementation that can serve as ground truth.
- Interface standardization – Each repo is wrapped with a thin shim exposing a set of functions (e.g.,
parse(),serialize()) that are language‑neutral. This lets the same test harness call into any language implementation. - Implementation‑agnostic test suite – Tests are compiled into language‑specific binaries that run the standardized interface and compare outputs against the original repository’s behavior. The test binaries are never exposed to the AI agents.
- Agent configurations – Four leading code‑generation agents (e.g., Codex‑based, GPT‑4‑based, and two open‑source models) are prompted to translate or refactor each repository without seeing the tests.
- Scoring – A pass‑rate is computed per repository: the fraction of hidden tests that the modernized code satisfies. Results are aggregated across size buckets (<10 K LOC, 10–50 K LOC, >50 K LOC).
Results & Findings
| Size bucket | Avg. pass rate (baseline agents) |
|---|---|
| < 10 K LOC | 91.3 % |
| 10–50 K LOC | 38.7 % |
| > 50 K LOC | 15.3 % |
- Scaling collapse: Performance degrades sharply as repository size grows, indicating current agents struggle with large‑scale architectural reasoning.
- Language robustness: Pass rates are consistently low across all eight languages, suggesting the bottleneck is not language‑specific but rather the ability to manage complex codebases.
- Test‑hiding effectiveness: Since agents never see the tests, the low scores cannot be blamed on overfitting to a visible test suite; they reflect genuine functional gaps.
Practical Implications
- Tooling developers: If you’re building AI‑assisted refactoring or migration tools, RepoMod‑Bench offers a realistic yardstick that mimics production constraints (no test visibility, multi‑language support).
- CI/CD integration: The black‑box test harness can be dropped into existing pipelines to automatically validate AI‑generated patches before they reach production.
- Enterprise migration: Companies looking to modernize legacy monoliths should temper expectations—current models perform well only on small, isolated components.
- Model training: The benchmark highlights the need for training data that captures architectural patterns (module boundaries, API contracts) rather than just snippet‑level completions.
Limitations & Future Work
- Scope of modernization: The benchmark focuses on functional equivalence; non‑functional aspects (performance, memory usage, security) are not measured.
- Static test coverage: While hidden, the test suites are still manually authored and may miss edge cases present in real deployments.
- Agent diversity: Only four configurations were evaluated; broader coverage (e.g., fine‑tuned domain‑specific models) could yield different scaling behavior.
- Future directions: Extending the benchmark to include performance regressions, adding more massive codebases (>500 K LOC), and exploring semi‑automated test generation to broaden evaluation criteria.
Authors
- Xuefeng Li
- Nir Ben-Israel
- Yotam Raz
- Belal Ahmed
- Doron Serebro
- Antoine Raux
Paper Information
- arXiv ID: 2602.22518v1
- Categories: cs.SE
- Published: February 26, 2026
- PDF: Download PDF