[Paper] From Laboratory to Real-World Applications: Benchmarking Agentic Code Reasoning at the Repository Level

Published: 1 month ago (January 7, 2026 at 04:22 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.03731v1

Overview

The paper introduces RepoReason, a new benchmark that pushes large‑language‑model (LLM) agents beyond isolated code snippets and forces them to reason over entire, real‑world software repositories. By turning the execution environment into a “semantic oracle,” the authors can generate fresh, non‑memorized test cases that still capture the deep logical inter‑dependencies developers face every day.

Key Contributions

RepoReason benchmark – a white‑box, repository‑level diagnostic suite focused on abductive assertion verification (i.e., “given a failing test, what code change explains it?”).
Execution‑driven mutation framework – automatically mutates real projects, runs them, and uses the observed runtime state to synthesize ground‑truth “what should have happened” facts, eliminating memorization shortcuts.
Fine‑grained diagnostic metrics – three orthogonal measures derived from dynamic program slicing:
1. ESV (Execution‑State Volume) – how much of the codebase the agent must read to reconstruct the relevant state.
2. MCL (Mutation‑Cause Length) – depth of the logical chain the agent must simulate to locate the bug’s root cause.
3. DFI (Dependency‑Fusion Index) – breadth of cross‑file integration the agent must handle simultaneously.
Comprehensive evaluation of state‑of‑the‑art agents (Claude‑4.5‑Sonnet, DeepSeek‑v3.1‑Terminus, etc.), revealing a systematic “aggregation deficit” where DFI is the dominant bottleneck.
Open‑source tooling for reproducing the benchmark and extending it to new repositories, encouraging community‑driven progress.

Methodology

Dataset Construction – The authors start from a curated set of open‑source repositories (e.g., popular Python and JavaScript projects).
Mutation & Execution – For each repo, they apply targeted source‑code mutations (e.g., off‑by‑one errors, missing imports). The mutated program is executed; the runtime environment (variable values, stack traces) serves as a semantic oracle that records the “correct” state before the bug manifests.
Assertion Generation – Using the oracle, they automatically generate logical assertions that should hold (e.g., “after process_data, len(result) == expected”). The mutated code deliberately violates these assertions.
Dynamic Program Slicing – When an assertion fails, the framework slices the execution trace to identify the minimal set of statements influencing the failure. This slice feeds the three metrics (ESV, MCL, DFI).
Agent Evaluation – LLM agents are prompted to explain the failure and propose a fix. Their responses are scored against the ground‑truth slice, yielding a white‑box view of where the agent succeeded or fell short.

The whole pipeline is fully automated, allowing large‑scale benchmarking without manual labeling.

Results & Findings

Performance Gap – Even the strongest agents achieve only ~38 % on the composite RepoReason score, far below human baseline (~85 %).
Aggregation Deficit – DFI (integration width) is the most predictive of failure; agents struggle when the bug’s cause spans many files or modules.
Reading Load vs. Simulation Depth – Agents handle high ESV reasonably well (they can “read” many lines), but MCL (deep logical chaining) remains a secondary challenge.
Model Size vs. Reasoning Ability – Scaling up model parameters yields diminishing returns on DFI; architectural changes (e.g., explicit memory or graph‑based reasoning) appear more promising.
Cross‑language Consistency – Performance patterns hold across Python, JavaScript, and Go repos, suggesting the bottleneck is architectural rather than language‑specific.

Practical Implications

Tooling for DevOps – RepoReason can be integrated into CI pipelines to stress‑test AI‑powered code reviewers or automated refactoring bots before they are deployed on production codebases.
Guidance for Model Designers – The three metrics act as a diagnostic “report card,” helping engineers pinpoint whether to invest in better retrieval mechanisms (to lower DFI) or deeper reasoning modules (to lower MCL).
Improved Bug‑Localization Assistants – By exposing the aggregation deficit, the paper motivates hybrid systems that combine LLM reasoning with graph‑based dependency analysis, potentially yielding more reliable automated debugging assistants.
Benchmark‑Driven Hiring – Companies evaluating LLM‑based coding assistants can use RepoReason as a realistic, repository‑level test rather than synthetic snippet challenges, leading to more trustworthy procurement decisions.

Limitations & Future Work

Scope of Mutations – The current mutation set focuses on classic logical bugs; more exotic defects (e.g., performance regressions, security vulnerabilities) remain untested.
Static vs. Dynamic Languages – While Python and JavaScript are well covered, languages with heavy compile‑time semantics (e.g., Rust, C++) may require additional slicing techniques.
Human Baseline – The paper reports a single human baseline; broader user studies would better quantify the gap between developers and LLM agents.
Scalability of Slicing – Dynamic slicing on very large repositories can be computationally expensive; future work could explore approximate slicing or static‑analysis proxies.
Agent Interaction Model – Evaluations assume a single turn of reasoning; multi‑turn, interactive debugging sessions could reveal different strengths and weaknesses.

RepoReason opens a concrete path toward evaluating—and ultimately improving—LLM agents that need to think like real software engineers, handling the tangled, multi‑file reality of production code.

Authors

Jia Li
Yuxin Su
Michael R. Lyu

Paper Information

arXiv ID: 2601.03731v1
Categories: cs.SE, cs.AI
Published: January 7, 2026
PDF: Download PDF

[Paper] From Laboratory to Real-World Applications: Benchmarking Agentic Code Reasoning at the Repository Level

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Manifold limit for the training of shallow graph convolutional neural networks

[Paper] AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling for LLMs

[Paper] LookAroundNet: Extending Temporal Context with Transformers for Clinically Viable EEG Seizure Detection

[Paper] Detecting Stochasticity in Discrete Signals via Nonparametric Excursion Theorem