[Paper] Agentic Repository Mining: A Multi-Task Evaluation
Source: arXiv - 2605.04845v1
Overview
The paper explores a fresh angle on software‑repository mining: instead of feeding a language model a static, hand‑crafted snippet of code or commit metadata, it lets a LLM‑driven “agent” roam the repository itself using ordinary bash commands (e.g., git log, grep, sed). The authors compare this dynamic, self‑sourcing approach against traditional “static‑prompt” LMs across four classification tasks and nearly 5 000 data points. The key finding is that agents can reach competitive accuracy while sidestepping the notorious context‑window limits of large models.
Key Contributions
- Agentic mining framework: Introduces a pipeline where an LLM issues bash commands, parses the output, and iteratively refines its answer—effectively turning the model into a repository‑aware assistant.
- Comprehensive multi‑task benchmark: Evaluates eight configurations (static vs. agentic, different prompting styles, model sizes) on four real‑world classification tasks (commit type, review sentiment, line‑level bug detection, repo‑level categorization).
- Empirical evidence of robustness: Shows agents maintain stable performance even when artifacts grow large, because they retrieve only the needed fragments instead of stuffing the whole file into the prompt.
- Error‑analysis insight: Manual inspection of 100 disagreement cases reveals many “mistakes” stem from ambiguous ground‑truth labels or from the static baselines lacking sufficient context, suggesting traditional accuracy numbers may under‑represent the value of broader context.
- Open‑source tooling: Releases the agentic mining scripts and benchmark data, enabling reproducibility and easy integration into CI pipelines.
Methodology
-
Task selection – Four classification problems were chosen to cover a spectrum of granularity:
- Commit intent (e.g., bug‑fix vs. feature)
- Code‑review sentiment (positive/negative)
- Line‑level defect detection (buggy vs. clean)
- Repository domain (web, data‑science, systems, etc.)
-
Approach configurations – For each task the authors built:
- Static LLM: Prompted with a pre‑extracted context (e.g., the full diff or the first N lines).
- Agentic LLM: Starts with a high‑level question, then iteratively runs bash commands (
git show,grep -R,wc -l, etc.) to fetch just‑in‑time information, feeding each result back into the model.
-
Model backbone – Experiments used OpenAI’s GPT‑3.5‑turbo and GPT‑4, as well as a smaller open‑source LLaMA‑2 variant, to test scaling behavior.
-
Evaluation – Accuracy against the curated ground‑truth labels was measured. Additionally, a human audit of 100 discordant predictions was performed to understand the nature of the errors.
-
Implementation details – The agent loop caps at five command‑execution steps per query to keep latency reasonable; timeouts and sandboxing ensure safety when running arbitrary bash.
Results & Findings
| Task | Static Prompt (GPT‑4) | Agentic (GPT‑4) | Static Prompt (LLaMA‑2) | Agentic (LLaMA‑2) |
|---|---|---|---|---|
| Commit intent | 84.2 % | 86.1 % | 71.5 % | 73.8 % |
| Review sentiment | 78.9 % | 80.4 % | 65.2 % | 68.0 % |
| Line‑level defect | 81.5 % | 82.9 % | 68.7 % | 70.5 % |
| Repo domain | 90.3 % | 90.7 % | 78.4 % | 79.1 % |
- Competitive accuracy: Across the board, agents match or slightly surpass static baselines, even when using a smaller model.
- Scalability: For large diffs (>10 k lines) static prompts hit the model’s token limit, causing a steep drop in accuracy, whereas agents continue to perform well by pulling only the relevant hunks.
- Robustness to ambiguity: In the manual audit, 62 % of disagreements were traced to vague or overlapping label definitions, not to model failure.
Practical Implications
- CI/CD integration: Teams can embed an LLM‑agent into their pipelines to automatically label commits, triage pull‑request sentiment, or flag suspicious lines without pre‑building massive prompt contexts.
- Cost efficiency: Because agents request only small snippets, they consume fewer tokens per classification, lowering API bills—especially valuable when working with GPT‑4‑level models.
- Tooling for legacy codebases: Large monorepos often exceed token windows; an agent can still navigate them, making repository‑wide analytics (e.g., tech‑debt heatmaps) feasible.
- Improved data labeling: When building supervised datasets, developers can let the agent explore the full history to generate richer, less noisy labels, reducing the manual effort needed for high‑quality training data.
Limitations & Future Work
- Command sandboxing: The current prototype runs bash commands on the host machine, which raises security concerns for untrusted repositories. Hardened sandbox environments are needed for production use.
- Latency: The iterative command‑execute loop adds overhead (≈2–3 s per classification). Optimizations such as caching frequent queries or parallelizing command execution could mitigate this.
- Ground‑truth quality: The authors note that many benchmark labels were derived from heuristics, leading to ambiguous cases; richer, human‑validated datasets would give a clearer picture of true performance.
- Generalization to non‑Git artifacts: Future research could extend the agentic paradigm to other VCS (Mercurial, Perforce) or to non‑code assets like documentation, issue trackers, and design diagrams.
Bottom line: By letting LLMs act as autonomous “shell assistants,” the paper demonstrates a practical path to scale repository mining beyond the constraints of static prompts—opening the door for smarter, more cost‑effective automation in everyday developer workflows.*
Authors
- Johannes Härtel
Paper Information
- arXiv ID: 2605.04845v1
- Categories: cs.SE
- Published: May 6, 2026
- PDF: Download PDF