[Paper] Breaking, Stale, or Missing? Benchmarking Coding Agents on Project-Level Test Evolution
Source: arXiv - 2605.06125v1
Overview
The paper introduces TEBench, the first benchmark that evaluates coding agents on project‑level test evolution. Instead of giving agents a pre‑selected method to fix, TEBench asks them to scan an entire repository, spot which tests break, become stale, or are missing after a code change, and then generate the appropriate test patches. This shift mirrors real‑world development cycles where engineers must keep test suites in sync with evolving production code.
Key Contributions
- Project‑level benchmark: 314 curated task instances from 10 Defects4J projects, each paired with a real developer commit and ground‑truth test changes.
- Three evolution categories:
- Test‑Breaking – existing tests that fail after the change.
- Test‑Stale – tests that still pass but no longer validate the intended behavior.
- Test‑Missing – entirely new tests required for newly introduced functionality.
- End‑to‑end evaluation pipeline: agents must (a) locate affected tests, (b) decide where new tests are needed, and (c) produce a syntactically correct test patch.
- Comprehensive empirical study: seven configurations spanning three industrial LLM‑based coding agents (Claude Code, Codex CLI, OpenCode) and six underlying models, plus a heuristic baseline.
- Open resources: benchmark code, data, and a public leaderboard for continuous community contributions.
Methodology
- Data construction – Starting from Defects4J, the authors extracted commits that modify production code and have associated test changes authored by developers. A four‑stage pipeline filtered, de‑duplicated, and annotated these commits, yielding the final 314 instances.
- Annotation of evolution types – Each instance was labeled as containing one or more of the three test‑evolution categories based on diff analysis and manual verification.
- Agent task definition – For a given repository snapshot and a target commit, an agent must output:
- A list of existing test files to modify (or delete).
- A list of new test files to create.
- The concrete code patches for those tests.
- Evaluation metrics –
- Identification F1: measures precision/recall of correctly spotting breaking, stale, and missing tests.
- Patch executability: whether the generated test code compiles and runs.
- Semantic similarity: surface‑form distance to the developer’s ground‑truth patch (used for analysis, not as a primary score).
- Baseline & agents – A simple heuristic that flags any test that fails on the changed code serves as a lower bound. The seven LLM configurations are run with identical prompts and temperature settings to ensure a fair comparison.
Results & Findings
| Metric | Range across configurations |
|---|---|
| Identification F1 (overall) | 45.7 % – 49.4 % |
| Test‑Breaking F1 | ~55 % (highest among the three) |
| Test‑Stale F1 | ≈ 36 % (most difficult) |
| Test‑Missing F1 | ~48 % |
| Executable patch rate | > 90 % for breaking tests, < 30 % for stale/missing |
Key takeaways
- All agents converge on a similar performance ceiling, indicating that current LLMs are limited more by the task formulation than by model size.
- The “execute‑fail‑fix” loop dominates: agents first run the test suite, detect failures, and then generate fixes. This works for breaking tests but provides no signal for stale or missing tests, explaining the low F1 on those categories.
- Generated patches are often syntactically correct but diverge heavily from the developer’s ground‑truth, suggesting that agents are solving the symptom (making the test pass) rather than reproducing the intent of the original test.
- The heuristic baseline, while simple, already captures a large portion of breaking tests, underscoring that failure signals are the primary driver for current agents.
Practical Implications
- Tooling for CI/CD – Integrating a TEBench‑compatible agent could automatically surface breaking tests after a pull request, reducing manual triage time.
- Test maintenance assistants – The benchmark highlights the need for agents that can reason about test relevance (staleness) and suggest new tests, a capability that would directly improve regression‑testing efficiency in large codebases.
- Model‑driven code review – Developers could receive early suggestions for test patches, especially for obvious breaking changes, accelerating the feedback loop.
- Benchmark‑driven product roadmaps – Companies building AI‑powered developer assistants now have a concrete, project‑scale metric to track progress beyond method‑level code generation.
- Open‑source contributions – Since TEBench is publicly available with a leaderboard, teams can benchmark their proprietary agents against the community baseline, fostering competitive improvement.
Limitations & Future Work
- Dataset size & diversity – Although 314 instances span ten projects, the benchmark still reflects a limited set of languages (primarily Java) and project structures.
- Reliance on execution failures – Current agents are biased toward detecting only breaking tests; richer semantic signals (e.g., change‑impact analysis, specification mining) are needed to handle stale and missing tests.
- Ground‑truth focus on developer patches – The evaluation treats the exact developer test as the gold standard, which may penalize valid alternative test designs produced by agents.
- Scalability to massive monorepos – The benchmark does not test agents on repositories with thousands of tests, where search and ranking become critical.
- Future directions suggested by the authors include: expanding to other ecosystems (Python, JavaScript), incorporating static analysis to surface stale tests without execution failures, and designing multi‑step prompting strategies that combine impact analysis with test generation.
Authors
- Ye Shang
- Quanjun Zhang
- Haichuan Hu
- Chunrong Fang
- Liang Xiao
- Zhenyu Chen
Paper Information
- arXiv ID: 2605.06125v1
- Categories: cs.SE
- Published: May 7, 2026
- PDF: Download PDF