[Paper] From Laboratory to Real-World Applications: Benchmarking Agentic Code Reasoning at the Repository Level

Published: (January 7, 2026 at 04:22 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.03731v1

Overview

The paper introduces RepoReason, a new benchmark that pushes large‑language‑model (LLM) agents beyond isolated code snippets and forces them to reason over entire, real‑world software repositories. By turning the execution environment into a “semantic oracle,” the authors can generate fresh, non‑memorized test cases that still capture the deep logical inter‑dependencies developers face every day.

Key Contributions

  • RepoReason benchmark – a white‑box, repository‑level diagnostic suite focused on abductive assertion verification (i.e., “given a failing test, what code change explains it?”).
  • Execution‑driven mutation framework – automatically mutates real projects, runs them, and uses the observed runtime state to synthesize ground‑truth “what should have happened” facts, eliminating memorization shortcuts.
  • Fine‑grained diagnostic metrics – three orthogonal measures derived from dynamic program slicing:
    1. ESV (Execution‑State Volume) – how much of the codebase the agent must read to reconstruct the relevant state.
    2. MCL (Mutation‑Cause Length) – depth of the logical chain the agent must simulate to locate the bug’s root cause.
    3. DFI (Dependency‑Fusion Index) – breadth of cross‑file integration the agent must handle simultaneously.
  • Comprehensive evaluation of state‑of‑the‑art agents (Claude‑4.5‑Sonnet, DeepSeek‑v3.1‑Terminus, etc.), revealing a systematic “aggregation deficit” where DFI is the dominant bottleneck.
  • Open‑source tooling for reproducing the benchmark and extending it to new repositories, encouraging community‑driven progress.

Methodology

  1. Dataset Construction – The authors start from a curated set of open‑source repositories (e.g., popular Python and JavaScript projects).
  2. Mutation & Execution – For each repo, they apply targeted source‑code mutations (e.g., off‑by‑one errors, missing imports). The mutated program is executed; the runtime environment (variable values, stack traces) serves as a semantic oracle that records the “correct” state before the bug manifests.
  3. Assertion Generation – Using the oracle, they automatically generate logical assertions that should hold (e.g., “after process_data, len(result) == expected”). The mutated code deliberately violates these assertions.
  4. Dynamic Program Slicing – When an assertion fails, the framework slices the execution trace to identify the minimal set of statements influencing the failure. This slice feeds the three metrics (ESV, MCL, DFI).
  5. Agent Evaluation – LLM agents are prompted to explain the failure and propose a fix. Their responses are scored against the ground‑truth slice, yielding a white‑box view of where the agent succeeded or fell short.

The whole pipeline is fully automated, allowing large‑scale benchmarking without manual labeling.

Results & Findings

  • Performance Gap – Even the strongest agents achieve only ~38 % on the composite RepoReason score, far below human baseline (~85 %).
  • Aggregation Deficit – DFI (integration width) is the most predictive of failure; agents struggle when the bug’s cause spans many files or modules.
  • Reading Load vs. Simulation Depth – Agents handle high ESV reasonably well (they can “read” many lines), but MCL (deep logical chaining) remains a secondary challenge.
  • Model Size vs. Reasoning Ability – Scaling up model parameters yields diminishing returns on DFI; architectural changes (e.g., explicit memory or graph‑based reasoning) appear more promising.
  • Cross‑language Consistency – Performance patterns hold across Python, JavaScript, and Go repos, suggesting the bottleneck is architectural rather than language‑specific.

Practical Implications

  • Tooling for DevOps – RepoReason can be integrated into CI pipelines to stress‑test AI‑powered code reviewers or automated refactoring bots before they are deployed on production codebases.
  • Guidance for Model Designers – The three metrics act as a diagnostic “report card,” helping engineers pinpoint whether to invest in better retrieval mechanisms (to lower DFI) or deeper reasoning modules (to lower MCL).
  • Improved Bug‑Localization Assistants – By exposing the aggregation deficit, the paper motivates hybrid systems that combine LLM reasoning with graph‑based dependency analysis, potentially yielding more reliable automated debugging assistants.
  • Benchmark‑Driven Hiring – Companies evaluating LLM‑based coding assistants can use RepoReason as a realistic, repository‑level test rather than synthetic snippet challenges, leading to more trustworthy procurement decisions.

Limitations & Future Work

  • Scope of Mutations – The current mutation set focuses on classic logical bugs; more exotic defects (e.g., performance regressions, security vulnerabilities) remain untested.
  • Static vs. Dynamic Languages – While Python and JavaScript are well covered, languages with heavy compile‑time semantics (e.g., Rust, C++) may require additional slicing techniques.
  • Human Baseline – The paper reports a single human baseline; broader user studies would better quantify the gap between developers and LLM agents.
  • Scalability of Slicing – Dynamic slicing on very large repositories can be computationally expensive; future work could explore approximate slicing or static‑analysis proxies.
  • Agent Interaction Model – Evaluations assume a single turn of reasoning; multi‑turn, interactive debugging sessions could reveal different strengths and weaknesses.

RepoReason opens a concrete path toward evaluating—and ultimately improving—LLM agents that need to think like real software engineers, handling the tangled, multi‑file reality of production code.

Authors

  • Jia Li
  • Yuxin Su
  • Michael R. Lyu

Paper Information

  • arXiv ID: 2601.03731v1
  • Categories: cs.SE, cs.AI
  • Published: January 7, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »