[Paper] One Tool Is Enough: Reinforcement Learning for Repository-Level LLM Agents

Published: 1 month ago (December 24, 2025 at 12:27 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.20957v1

Overview

Large‑scale open‑source repositories contain millions of files and functions, making it hard for developers (and even sophisticated AI assistants) to pinpoint exactly where a bug or a required change lives. The paper One Tool Is Enough: Reinforcement Learning for Repository‑Level LLM Agents introduces RepoNavigator, an LLM‑driven agent that uses just one execution‑aware tool—jump‑to‑definition—to navigate code. By training this agent with reinforcement learning (RL) directly from a pretrained model, the authors achieve state‑of‑the‑art issue‑localization performance while keeping the system simple and scalable.

Key Contributions

Single‑tool design: Introduces a unified “jump‑to‑definition” tool that mirrors actual code execution flow, eliminating the need for a suite of ad‑hoc retrieval or search utilities.
End‑to‑end RL fine‑tuning: Trains RepoNavigator via reinforcement learning from a pretrained LLM, avoiding any closed‑source distillation or multi‑stage pipelines.
Scalable performance gains: Demonstrates that a 7 B parameter model outperforms 14 B baselines, a 14 B model beats 32 B competitors, and a 32 B model surpasses proprietary models like Claude‑3.7.
Open‑source friendliness: The approach works with publicly available models and tools, making it reproducible for the research and developer community.
Empirical validation: Provides extensive experiments on real OSS repositories, showing consistent improvements in locating the correct files/functions for a range of issue‑fix tasks.

Methodology

Tool Definition – Jump‑to‑Definition
- The agent can invoke a single tool that, given a symbol (function, class, variable), returns the exact location of its definition in the repository.
- This mirrors how developers navigate code in IDEs and respects the actual call‑graph / import hierarchy.
RL Formulation
- The problem is cast as a sequential decision process: at each step the LLM decides whether to query the tool, which symbol to query, or to output a final answer (the target file/function).
- Rewards are sparse but informative: a high reward is given when the final prediction matches the ground‑truth location; intermediate rewards encourage efficient tool usage (e.g., fewer jumps).
Training Pipeline
- Start from a pretrained LLM (7 B, 14 B, or 32 B).
- Generate trajectories by letting the model interact with the tool on a training set of issue‑localization examples.
- Apply Proximal Policy Optimization (PPO) to update the model’s policy, keeping the language generation capabilities intact while improving navigation decisions.
Evaluation Setup
- Benchmarks consist of real GitHub issues across multiple large projects (e.g., Linux kernel, TensorFlow).
- Metrics include top‑1 / top‑5 accuracy of locating the correct file/function and the number of tool calls per query.

Results & Findings

Model (size)	Top‑1 Accuracy	Top‑5 Accuracy	Avg. Tool Calls
7 B (RepoNavigator)	48.2 %	71.5 %	2.3
14 B (RepoNavigator)	55.9 %	78.1 %	2.1
32 B (RepoNavigator)	62.4 %	84.3 %	1.9
14 B baseline (multi‑tool)	44.7 %	68.2 %	4.7
32 B baseline (multi‑tool)	51.3 %	73.9 %	4.2
Claude‑3.7 (closed‑source)	58.1 %	80.0 %	–

Efficiency: RepoNavigator uses roughly half the tool calls of multi‑tool baselines, reducing latency and API costs.
Model Size vs. Performance: Even the smallest 7 B model beats larger, un‑fine‑tuned baselines, confirming that RL fine‑tuning is more impactful than raw parameter count.
Closed‑source Gap Closed: The 32 B version surpasses Claude‑3.7, showing that open‑source pipelines can rival proprietary offerings when equipped with the right training signal.

Practical Implications

Developer Assistants: IDE plugins can embed RepoNavigator to instantly suggest the file/function a bug report refers to, cutting down triage time.
Automated Code Review: Bots can automatically locate the exact code region a reviewer comments on, enabling precise, context‑aware suggestions.
Continuous Integration (CI) Optimization: CI pipelines can use the agent to run targeted tests only on the affected modules, speeding up feedback loops.
Cost‑Effective Scaling: Because only one lightweight tool is needed, cloud‑based LLM services can reduce request overhead and token consumption, making large‑scale deployment cheaper.
Open‑Source Ecosystem: Teams can adopt RepoNavigator without licensing restrictions, fostering community‑driven improvements and custom extensions (e.g., language‑specific jump‑to‑definition back‑ends).

Limitations & Future Work

Sparse Reward Signal: RL training relies on binary success/failure at the end of a trajectory, which can make convergence slow for very large repositories.
Tool Dependency: The approach assumes a reliable, language‑aware jump‑to‑definition service; inaccuracies in that service directly affect the agent’s performance.
Generalization to Non‑Code Artifacts: The current design focuses on symbols in source code; extending to configuration files, build scripts, or documentation remains open.
Multi‑Issue Scenarios: Handling issues that span multiple files/functions (e.g., cross‑module refactoring) is not fully explored.
Future Directions: The authors suggest exploring hierarchical RL to better handle multi‑step reasoning, integrating static analysis for richer rewards, and expanding the toolset to include “find‑usages” or “run‑test” primitives while preserving the single‑tool simplicity ethos.

Authors

Zhaoxi Zhang
Yitong Duan
Yanzhi Zhang
Yiming Xu
Jiyan He
Yunfang Wu

Paper Information

arXiv ID: 2512.20957v1
Categories: cs.SE, cs.AI
Published: December 24, 2025
PDF: Download PDF

[Paper] One Tool Is Enough: Reinforcement Learning for Repository-Level LLM Agents

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Agentic Structured Graph Traversal for Root Cause Analysis of Code-related Incidents in Cloud Applications

[Paper] Pruning as a Game: Equilibrium-Driven Sparsification of Neural Networks

[Paper] Explainable Multimodal Regression via Information Decomposition

[Paper] A2P-Vis: an Analyzer-to-Presenter Agentic Pipeline for Visual Insights Generation and Reporting