[Paper] Evaluating and Improving Automated Repository-Level Rust Issue Resolution with LLM-based Agents

Published: 3 days ago (February 26, 2026 at 03:54 AM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.22764v1

Overview

The paper introduces Rust‑SWE‑bench, the first large‑scale benchmark of 500 real‑world, repository‑level software‑engineering tasks drawn from 34 popular Rust projects. Using this benchmark, the authors evaluate several LLM‑powered coding agents and present RUSTFORGER, a new agentic system that dramatically improves automated issue resolution for Rust codebases.

Key Contributions

Rust‑SWE‑bench: a curated, diverse benchmark of 500 repository‑level Rust issues, complete with code, test suites, and metadata.
Comprehensive evaluation of four ReAct‑style agents and four state‑of‑the‑art LLMs (Claude, GPT‑4, Gemini, Llama 2) on the benchmark, establishing baseline performance numbers for the Rust ecosystem.
Diagnosis of failure modes: identification of two dominant challenges—understanding whole‑repo code structure and handling Rust’s strict type/trait system.
RUSTFORGER: a novel agent framework that (1) automatically provisions a reproducible test environment, and (2) employs Rust metaprogramming‑driven dynamic tracing to surface runtime information during issue reproduction.
Empirical gains: RUSTFORGER + Claude‑Sonnet‑3.7 resolves 28.6 % of benchmark tasks (a 34.9 % lift over the strongest baseline) and uniquely solves 46 tasks that no other LLM‑agent combination could handle.

Methodology

Benchmark construction – The authors mined GitHub for closed Rust issues that required code changes and had an associated test or reproducible failure. Each entry includes the full repository snapshot, the original issue description, and a minimal test harness.
Agent baseline – Four ReAct‑style agents (each wrapping a different LLM) were given the issue text, repository files, and a sandboxed Rust toolchain. Agents could query the code, run tests, and edit files iteratively.
Failure analysis – After running the baseline agents, the team manually inspected unsuccessful runs to pinpoint systematic gaps (e.g., missing crate‑level context, trait‑resolution errors).
RUSTFORGER design –
- Automated environment setup: scripts automatically fetch dependencies, compile the crate, and run the failing test to confirm the problem.
- Dynamic tracing: using Rust’s procedural‑macro and std::trace facilities, the agent injects instrumentation that logs type information, trait implementations, and runtime values at the point of failure.
- Iterative reasoning loop: the LLM consumes the trace logs, updates its mental model, and proposes a patch, which is then re‑tested.
Evaluation – All agents (baseline + RUSTFORGER) were run on the full benchmark under identical hardware and time budgets. Success is defined as a patch that (a) compiles, (b) passes the original failing test, and (c) does not break any existing tests.

Results & Findings

Agent (LLM)	Success Rate	Notable Strengths
ReAct + Claude‑Sonnet‑3.7	21.2 %	Good at high‑level reasoning, but often stalls on crate‑wide imports.
ReAct + GPT‑4	18.7 %	Strong natural‑language understanding, weaker on Rust‑specific semantics.
ReAct + Gemini‑1.5	16.9 %	Handles simple bug patterns, struggles with lifetimes/traits.
ReAct + Llama‑2‑70B	14.3 %	Decent at code generation, frequent type‑mismatch errors.
RUSTFORGER + Claude‑Sonnet‑3.7	28.6 %	34.9 % improvement over best baseline; uniquely solves 46 tasks.

Key takeaways

Issue reproduction matters: agents that could reliably reproduce the failing test before attempting a fix performed far better.
Repository‑wide context is a bottleneck: many failures stemmed from missing knowledge about other modules, feature flags, or macro expansions.
Dynamic tracing bridges the gap: exposing concrete type and trait information at runtime gave the LLM the missing pieces to reason about Rust’s strict compile‑time guarantees.

Practical Implications

Developer tooling – Integrating RUSTFORGER‑style pipelines into IDE extensions (e.g., VS Code, IntelliJ) could offer “one‑click” automated suggestions for open Rust issues, reducing triage time.
Continuous Integration (CI) – CI jobs could automatically invoke a RUSTFORGER agent on newly opened issues, generating draft patches that reviewers can validate, accelerating bug‑fix cycles.
Onboarding & education – New Rust contributors often stumble on lifetimes and trait bounds; an agent that reproduces and explains failures with concrete traces can serve as an interactive tutor.
Enterprise Rust codebases – Large monorepos (e.g., in systems programming or blockchain) can benefit from automated issue resolution at scale, freeing senior engineers to focus on architectural work.
Benchmark‑driven LLM development – Rust‑SWE‑bench provides a concrete target for future LLM fine‑tuning, encouraging the community to build models that understand low‑level systems languages.

Limitations & Future Work

Benchmark bias – The dataset leans toward issues that already have test cases; many real‑world bugs lack reproducible tests, so performance may differ in the wild.
LLM dependence – RUSTFORGER’s gains are tied to the underlying LLM (Claude‑Sonnet‑3.7); results could vary with other models or future versions.
Scalability of tracing – Injecting dynamic instrumentation adds compile‑time overhead; for very large crates the setup may become costly.
Generalization beyond Rust – While the tracing technique exploits Rust’s macro system, adapting it to other compiled languages (e.g., C++, Go) will require language‑specific instrumentation.

Future research directions include expanding the benchmark to cover non‑test‑driven bugs, optimizing the tracing pipeline for speed, and exploring multi‑agent collaboration where one LLM handles environment setup while another focuses on patch synthesis.

Authors

Jiahong Xiang
Wenxiao He
Xihua Wang
Hongliang Tian
Yuqun Zhang

Paper Information

arXiv ID: 2602.22764v1
Categories: cs.SE
Published: February 26, 2026
PDF: Download PDF

[Paper] Evaluating and Improving Automated Repository-Level Rust Issue Resolution with LLM-based Agents

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Array-Carrying Symbolic Execution for Function Contract Generation

[Paper] LLM-Powered Silent Bug Fuzzing in Deep Learning Libraries via Versatile and Controlled Bug Transfer

[Paper] CL4SE: A Context Learning Benchmark For Software Engineering Tasks

[Paper] Managing Uncertainty in LLM-based Multi-Agent System Operation