[Paper] Evaluating and Improving Automated Repository-Level Rust Issue Resolution with LLM-based Agents
Source: arXiv - 2602.22764v1
Overview
The paper introduces Rust‑SWE‑bench, the first large‑scale benchmark of 500 real‑world, repository‑level software‑engineering tasks drawn from 34 popular Rust projects. Using this benchmark, the authors evaluate several LLM‑powered coding agents and present RUSTFORGER, a new agentic system that dramatically improves automated issue resolution for Rust codebases.
Key Contributions
- Rust‑SWE‑bench: a curated, diverse benchmark of 500 repository‑level Rust issues, complete with code, test suites, and metadata.
- Comprehensive evaluation of four ReAct‑style agents and four state‑of‑the‑art LLMs (Claude, GPT‑4, Gemini, Llama 2) on the benchmark, establishing baseline performance numbers for the Rust ecosystem.
- Diagnosis of failure modes: identification of two dominant challenges—understanding whole‑repo code structure and handling Rust’s strict type/trait system.
- RUSTFORGER: a novel agent framework that (1) automatically provisions a reproducible test environment, and (2) employs Rust metaprogramming‑driven dynamic tracing to surface runtime information during issue reproduction.
- Empirical gains: RUSTFORGER + Claude‑Sonnet‑3.7 resolves 28.6 % of benchmark tasks (a 34.9 % lift over the strongest baseline) and uniquely solves 46 tasks that no other LLM‑agent combination could handle.
Methodology
- Benchmark construction – The authors mined GitHub for closed Rust issues that required code changes and had an associated test or reproducible failure. Each entry includes the full repository snapshot, the original issue description, and a minimal test harness.
- Agent baseline – Four ReAct‑style agents (each wrapping a different LLM) were given the issue text, repository files, and a sandboxed Rust toolchain. Agents could query the code, run tests, and edit files iteratively.
- Failure analysis – After running the baseline agents, the team manually inspected unsuccessful runs to pinpoint systematic gaps (e.g., missing crate‑level context, trait‑resolution errors).
- RUSTFORGER design –
- Automated environment setup: scripts automatically fetch dependencies, compile the crate, and run the failing test to confirm the problem.
- Dynamic tracing: using Rust’s procedural‑macro and
std::tracefacilities, the agent injects instrumentation that logs type information, trait implementations, and runtime values at the point of failure. - Iterative reasoning loop: the LLM consumes the trace logs, updates its mental model, and proposes a patch, which is then re‑tested.
- Evaluation – All agents (baseline + RUSTFORGER) were run on the full benchmark under identical hardware and time budgets. Success is defined as a patch that (a) compiles, (b) passes the original failing test, and (c) does not break any existing tests.
Results & Findings
| Agent (LLM) | Success Rate | Notable Strengths |
|---|---|---|
| ReAct + Claude‑Sonnet‑3.7 | 21.2 % | Good at high‑level reasoning, but often stalls on crate‑wide imports. |
| ReAct + GPT‑4 | 18.7 % | Strong natural‑language understanding, weaker on Rust‑specific semantics. |
| ReAct + Gemini‑1.5 | 16.9 % | Handles simple bug patterns, struggles with lifetimes/traits. |
| ReAct + Llama‑2‑70B | 14.3 % | Decent at code generation, frequent type‑mismatch errors. |
| RUSTFORGER + Claude‑Sonnet‑3.7 | 28.6 % | 34.9 % improvement over best baseline; uniquely solves 46 tasks. |
Key takeaways
- Issue reproduction matters: agents that could reliably reproduce the failing test before attempting a fix performed far better.
- Repository‑wide context is a bottleneck: many failures stemmed from missing knowledge about other modules, feature flags, or macro expansions.
- Dynamic tracing bridges the gap: exposing concrete type and trait information at runtime gave the LLM the missing pieces to reason about Rust’s strict compile‑time guarantees.
Practical Implications
- Developer tooling – Integrating RUSTFORGER‑style pipelines into IDE extensions (e.g., VS Code, IntelliJ) could offer “one‑click” automated suggestions for open Rust issues, reducing triage time.
- Continuous Integration (CI) – CI jobs could automatically invoke a RUSTFORGER agent on newly opened issues, generating draft patches that reviewers can validate, accelerating bug‑fix cycles.
- Onboarding & education – New Rust contributors often stumble on lifetimes and trait bounds; an agent that reproduces and explains failures with concrete traces can serve as an interactive tutor.
- Enterprise Rust codebases – Large monorepos (e.g., in systems programming or blockchain) can benefit from automated issue resolution at scale, freeing senior engineers to focus on architectural work.
- Benchmark‑driven LLM development – Rust‑SWE‑bench provides a concrete target for future LLM fine‑tuning, encouraging the community to build models that understand low‑level systems languages.
Limitations & Future Work
- Benchmark bias – The dataset leans toward issues that already have test cases; many real‑world bugs lack reproducible tests, so performance may differ in the wild.
- LLM dependence – RUSTFORGER’s gains are tied to the underlying LLM (Claude‑Sonnet‑3.7); results could vary with other models or future versions.
- Scalability of tracing – Injecting dynamic instrumentation adds compile‑time overhead; for very large crates the setup may become costly.
- Generalization beyond Rust – While the tracing technique exploits Rust’s macro system, adapting it to other compiled languages (e.g., C++, Go) will require language‑specific instrumentation.
Future research directions include expanding the benchmark to cover non‑test‑driven bugs, optimizing the tracing pipeline for speed, and exploring multi‑agent collaboration where one LLM handles environment setup while another focuses on patch synthesis.
Authors
- Jiahong Xiang
- Wenxiao He
- Xihua Wang
- Hongliang Tian
- Yuqun Zhang
Paper Information
- arXiv ID: 2602.22764v1
- Categories: cs.SE
- Published: February 26, 2026
- PDF: Download PDF