[Paper] RepoMod-Bench: A Benchmark for Code Repository Modernization via Implementation-Agnostic Testing

Published: (February 25, 2026 at 08:25 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.22518v1

Overview

The paper introduces RepoMod‑Bench, a new benchmark that measures how well AI‑driven coding agents can modernize entire code repositories. By using implementation‑agnostic (black‑box) tests that are hidden from the agents, the authors provide a deterministic, language‑independent way to evaluate functional equivalence between the original and the modernized code—something that prior benchmarks have struggled to achieve.

Key Contributions

  • Large‑scale, multi‑language benchmark: 21 real‑world repositories (14 K–211 K LOC) spanning 8 programming languages, totaling 1.6 M LOC and 11 616 tests.
  • Implementation‑agnostic evaluation: Tests are executed as black‑box binaries, preventing agents from “cheating” by reading or tailoring to the test suite.
  • Standardized interfaces: Each repo is wrapped with a uniform API, enabling cross‑language functional checks.
  • Empirical baseline: Four state‑of‑the‑art AI coding agents are evaluated, revealing a dramatic drop in pass rates as repo size grows (91 % → 15 %).
  • Open‑source release: Benchmark data, test harness, and evaluation scripts are publicly available on GitHub.

Methodology

  1. Repository selection – Real‑world open‑source projects were chosen for diversity (different domains, languages, and sizes) and for having a clear “source” implementation that can serve as ground truth.
  2. Interface standardization – Each repo is wrapped with a thin shim exposing a set of functions (e.g., parse(), serialize()) that are language‑neutral. This lets the same test harness call into any language implementation.
  3. Implementation‑agnostic test suite – Tests are compiled into language‑specific binaries that run the standardized interface and compare outputs against the original repository’s behavior. The test binaries are never exposed to the AI agents.
  4. Agent configurations – Four leading code‑generation agents (e.g., Codex‑based, GPT‑4‑based, and two open‑source models) are prompted to translate or refactor each repository without seeing the tests.
  5. Scoring – A pass‑rate is computed per repository: the fraction of hidden tests that the modernized code satisfies. Results are aggregated across size buckets (<10 K LOC, 10–50 K LOC, >50 K LOC).

Results & Findings

Size bucketAvg. pass rate (baseline agents)
< 10 K LOC91.3 %
10–50 K LOC38.7 %
> 50 K LOC15.3 %
  • Scaling collapse: Performance degrades sharply as repository size grows, indicating current agents struggle with large‑scale architectural reasoning.
  • Language robustness: Pass rates are consistently low across all eight languages, suggesting the bottleneck is not language‑specific but rather the ability to manage complex codebases.
  • Test‑hiding effectiveness: Since agents never see the tests, the low scores cannot be blamed on overfitting to a visible test suite; they reflect genuine functional gaps.

Practical Implications

  • Tooling developers: If you’re building AI‑assisted refactoring or migration tools, RepoMod‑Bench offers a realistic yardstick that mimics production constraints (no test visibility, multi‑language support).
  • CI/CD integration: The black‑box test harness can be dropped into existing pipelines to automatically validate AI‑generated patches before they reach production.
  • Enterprise migration: Companies looking to modernize legacy monoliths should temper expectations—current models perform well only on small, isolated components.
  • Model training: The benchmark highlights the need for training data that captures architectural patterns (module boundaries, API contracts) rather than just snippet‑level completions.

Limitations & Future Work

  • Scope of modernization: The benchmark focuses on functional equivalence; non‑functional aspects (performance, memory usage, security) are not measured.
  • Static test coverage: While hidden, the test suites are still manually authored and may miss edge cases present in real deployments.
  • Agent diversity: Only four configurations were evaluated; broader coverage (e.g., fine‑tuned domain‑specific models) could yield different scaling behavior.
  • Future directions: Extending the benchmark to include performance regressions, adding more massive codebases (>500 K LOC), and exploring semi‑automated test generation to broaden evaluation criteria.

Authors

  • Xuefeng Li
  • Nir Ben-Israel
  • Yotam Raz
  • Belal Ahmed
  • Doron Serebro
  • Antoine Raux

Paper Information

  • arXiv ID: 2602.22518v1
  • Categories: cs.SE
  • Published: February 26, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »