[Paper] From Laboratory to Real-World Applications: Benchmarking Agentic Code Reasoning at the Repository Level
Source: arXiv - 2601.03731v1
Overview
The paper introduces RepoReason, a new benchmark that pushes large‑language‑model (LLM) agents beyond isolated code snippets and forces them to reason over entire, real‑world software repositories. By turning the execution environment into a “semantic oracle,” the authors can generate fresh, non‑memorized test cases that still capture the deep logical inter‑dependencies developers face every day.
Key Contributions
- RepoReason benchmark – a white‑box, repository‑level diagnostic suite focused on abductive assertion verification (i.e., “given a failing test, what code change explains it?”).
- Execution‑driven mutation framework – automatically mutates real projects, runs them, and uses the observed runtime state to synthesize ground‑truth “what should have happened” facts, eliminating memorization shortcuts.
- Fine‑grained diagnostic metrics – three orthogonal measures derived from dynamic program slicing:
- ESV (Execution‑State Volume) – how much of the codebase the agent must read to reconstruct the relevant state.
- MCL (Mutation‑Cause Length) – depth of the logical chain the agent must simulate to locate the bug’s root cause.
- DFI (Dependency‑Fusion Index) – breadth of cross‑file integration the agent must handle simultaneously.
- Comprehensive evaluation of state‑of‑the‑art agents (Claude‑4.5‑Sonnet, DeepSeek‑v3.1‑Terminus, etc.), revealing a systematic “aggregation deficit” where DFI is the dominant bottleneck.
- Open‑source tooling for reproducing the benchmark and extending it to new repositories, encouraging community‑driven progress.
Methodology
- Dataset Construction – The authors start from a curated set of open‑source repositories (e.g., popular Python and JavaScript projects).
- Mutation & Execution – For each repo, they apply targeted source‑code mutations (e.g., off‑by‑one errors, missing imports). The mutated program is executed; the runtime environment (variable values, stack traces) serves as a semantic oracle that records the “correct” state before the bug manifests.
- Assertion Generation – Using the oracle, they automatically generate logical assertions that should hold (e.g., “after
process_data,len(result) == expected”). The mutated code deliberately violates these assertions. - Dynamic Program Slicing – When an assertion fails, the framework slices the execution trace to identify the minimal set of statements influencing the failure. This slice feeds the three metrics (ESV, MCL, DFI).
- Agent Evaluation – LLM agents are prompted to explain the failure and propose a fix. Their responses are scored against the ground‑truth slice, yielding a white‑box view of where the agent succeeded or fell short.
The whole pipeline is fully automated, allowing large‑scale benchmarking without manual labeling.
Results & Findings
- Performance Gap – Even the strongest agents achieve only ~38 % on the composite RepoReason score, far below human baseline (~85 %).
- Aggregation Deficit – DFI (integration width) is the most predictive of failure; agents struggle when the bug’s cause spans many files or modules.
- Reading Load vs. Simulation Depth – Agents handle high ESV reasonably well (they can “read” many lines), but MCL (deep logical chaining) remains a secondary challenge.
- Model Size vs. Reasoning Ability – Scaling up model parameters yields diminishing returns on DFI; architectural changes (e.g., explicit memory or graph‑based reasoning) appear more promising.
- Cross‑language Consistency – Performance patterns hold across Python, JavaScript, and Go repos, suggesting the bottleneck is architectural rather than language‑specific.
Practical Implications
- Tooling for DevOps – RepoReason can be integrated into CI pipelines to stress‑test AI‑powered code reviewers or automated refactoring bots before they are deployed on production codebases.
- Guidance for Model Designers – The three metrics act as a diagnostic “report card,” helping engineers pinpoint whether to invest in better retrieval mechanisms (to lower DFI) or deeper reasoning modules (to lower MCL).
- Improved Bug‑Localization Assistants – By exposing the aggregation deficit, the paper motivates hybrid systems that combine LLM reasoning with graph‑based dependency analysis, potentially yielding more reliable automated debugging assistants.
- Benchmark‑Driven Hiring – Companies evaluating LLM‑based coding assistants can use RepoReason as a realistic, repository‑level test rather than synthetic snippet challenges, leading to more trustworthy procurement decisions.
Limitations & Future Work
- Scope of Mutations – The current mutation set focuses on classic logical bugs; more exotic defects (e.g., performance regressions, security vulnerabilities) remain untested.
- Static vs. Dynamic Languages – While Python and JavaScript are well covered, languages with heavy compile‑time semantics (e.g., Rust, C++) may require additional slicing techniques.
- Human Baseline – The paper reports a single human baseline; broader user studies would better quantify the gap between developers and LLM agents.
- Scalability of Slicing – Dynamic slicing on very large repositories can be computationally expensive; future work could explore approximate slicing or static‑analysis proxies.
- Agent Interaction Model – Evaluations assume a single turn of reasoning; multi‑turn, interactive debugging sessions could reveal different strengths and weaknesses.
RepoReason opens a concrete path toward evaluating—and ultimately improving—LLM agents that need to think like real software engineers, handling the tangled, multi‑file reality of production code.
Authors
- Jia Li
- Yuxin Su
- Michael R. Lyu
Paper Information
- arXiv ID: 2601.03731v1
- Categories: cs.SE, cs.AI
- Published: January 7, 2026
- PDF: Download PDF