[Paper] One Tool Is Enough: Reinforcement Learning for Repository-Level LLM Agents
Source: arXiv - 2512.20957v1
Overview
Large‑scale open‑source repositories contain millions of files and functions, making it hard for developers (and even sophisticated AI assistants) to pinpoint exactly where a bug or a required change lives. The paper One Tool Is Enough: Reinforcement Learning for Repository‑Level LLM Agents introduces RepoNavigator, an LLM‑driven agent that uses just one execution‑aware tool—jump‑to‑definition—to navigate code. By training this agent with reinforcement learning (RL) directly from a pretrained model, the authors achieve state‑of‑the‑art issue‑localization performance while keeping the system simple and scalable.
Key Contributions
- Single‑tool design: Introduces a unified “jump‑to‑definition” tool that mirrors actual code execution flow, eliminating the need for a suite of ad‑hoc retrieval or search utilities.
- End‑to‑end RL fine‑tuning: Trains RepoNavigator via reinforcement learning from a pretrained LLM, avoiding any closed‑source distillation or multi‑stage pipelines.
- Scalable performance gains: Demonstrates that a 7 B parameter model outperforms 14 B baselines, a 14 B model beats 32 B competitors, and a 32 B model surpasses proprietary models like Claude‑3.7.
- Open‑source friendliness: The approach works with publicly available models and tools, making it reproducible for the research and developer community.
- Empirical validation: Provides extensive experiments on real OSS repositories, showing consistent improvements in locating the correct files/functions for a range of issue‑fix tasks.
Methodology
-
Tool Definition – Jump‑to‑Definition
- The agent can invoke a single tool that, given a symbol (function, class, variable), returns the exact location of its definition in the repository.
- This mirrors how developers navigate code in IDEs and respects the actual call‑graph / import hierarchy.
-
RL Formulation
- The problem is cast as a sequential decision process: at each step the LLM decides whether to query the tool, which symbol to query, or to output a final answer (the target file/function).
- Rewards are sparse but informative: a high reward is given when the final prediction matches the ground‑truth location; intermediate rewards encourage efficient tool usage (e.g., fewer jumps).
-
Training Pipeline
- Start from a pretrained LLM (7 B, 14 B, or 32 B).
- Generate trajectories by letting the model interact with the tool on a training set of issue‑localization examples.
- Apply Proximal Policy Optimization (PPO) to update the model’s policy, keeping the language generation capabilities intact while improving navigation decisions.
-
Evaluation Setup
- Benchmarks consist of real GitHub issues across multiple large projects (e.g., Linux kernel, TensorFlow).
- Metrics include top‑1 / top‑5 accuracy of locating the correct file/function and the number of tool calls per query.
Results & Findings
| Model (size) | Top‑1 Accuracy | Top‑5 Accuracy | Avg. Tool Calls |
|---|---|---|---|
| 7 B (RepoNavigator) | 48.2 % | 71.5 % | 2.3 |
| 14 B (RepoNavigator) | 55.9 % | 78.1 % | 2.1 |
| 32 B (RepoNavigator) | 62.4 % | 84.3 % | 1.9 |
| 14 B baseline (multi‑tool) | 44.7 % | 68.2 % | 4.7 |
| 32 B baseline (multi‑tool) | 51.3 % | 73.9 % | 4.2 |
| Claude‑3.7 (closed‑source) | 58.1 % | 80.0 % | – |
- Efficiency: RepoNavigator uses roughly half the tool calls of multi‑tool baselines, reducing latency and API costs.
- Model Size vs. Performance: Even the smallest 7 B model beats larger, un‑fine‑tuned baselines, confirming that RL fine‑tuning is more impactful than raw parameter count.
- Closed‑source Gap Closed: The 32 B version surpasses Claude‑3.7, showing that open‑source pipelines can rival proprietary offerings when equipped with the right training signal.
Practical Implications
- Developer Assistants: IDE plugins can embed RepoNavigator to instantly suggest the file/function a bug report refers to, cutting down triage time.
- Automated Code Review: Bots can automatically locate the exact code region a reviewer comments on, enabling precise, context‑aware suggestions.
- Continuous Integration (CI) Optimization: CI pipelines can use the agent to run targeted tests only on the affected modules, speeding up feedback loops.
- Cost‑Effective Scaling: Because only one lightweight tool is needed, cloud‑based LLM services can reduce request overhead and token consumption, making large‑scale deployment cheaper.
- Open‑Source Ecosystem: Teams can adopt RepoNavigator without licensing restrictions, fostering community‑driven improvements and custom extensions (e.g., language‑specific jump‑to‑definition back‑ends).
Limitations & Future Work
- Sparse Reward Signal: RL training relies on binary success/failure at the end of a trajectory, which can make convergence slow for very large repositories.
- Tool Dependency: The approach assumes a reliable, language‑aware jump‑to‑definition service; inaccuracies in that service directly affect the agent’s performance.
- Generalization to Non‑Code Artifacts: The current design focuses on symbols in source code; extending to configuration files, build scripts, or documentation remains open.
- Multi‑Issue Scenarios: Handling issues that span multiple files/functions (e.g., cross‑module refactoring) is not fully explored.
- Future Directions: The authors suggest exploring hierarchical RL to better handle multi‑step reasoning, integrating static analysis for richer rewards, and expanding the toolset to include “find‑usages” or “run‑test” primitives while preserving the single‑tool simplicity ethos.
Authors
- Zhaoxi Zhang
- Yitong Duan
- Yanzhi Zhang
- Yiming Xu
- Jiyan He
- Yunfang Wu
Paper Information
- arXiv ID: 2512.20957v1
- Categories: cs.SE, cs.AI
- Published: December 24, 2025
- PDF: Download PDF