[Paper] Think Harder and Don't Overlook Your Options: Revisiting Issue-Commit Linking with LLM-Assisted Retrieval
Source: arXiv - 2605.00447v1
Overview
Linking issue reports to the commits that fix them is a cornerstone of software traceability, yet doing it manually is tedious and error‑prone. This paper revisits a suite of classic and modern issue‑commit linking techniques, evaluates how well they retrieve and rank candidate commits, and asks whether heavyweight large language models (LLMs) actually give a measurable boost over lighter, more traditional methods.
Key Contributions
- Comprehensive benchmark of five established linking pipelines (BTLink, EasyLink, FRLink, RCLinker, Hybrid‑Linker) on a common dataset.
- Systematic comparison of retrieval back‑ends: sparse (BM25, BM25L) vs. dense (SBERT‑Semantic Search, ANNOY, LSH, HNSW).
- Reranking study that pits traditional ML models (logistic regression, gradient‑boosted trees) and a cross‑encoder against several LLMs (ChatGPT, Qwen, Gemma, Llama).
- Evidence that dense retrieval + sparse hybrid yields the best recall while keeping candidate sets small.
- Finding that classic ML rerankers outperform LLMs in precision, challenging the “bigger is better” assumption for this task.
Methodology
- Data preparation – The authors collected a large corpus of issue‑commit pairs from open‑source projects, splitting it into training, validation, and test folds.
- Retrieval stage – Each issue is used as a query to pull a shortlist of candidate commits.
- Sparse methods rely on term‑frequency statistics (BM25/BM25L).
- Dense methods embed issues and commits into a vector space (SBERT) and perform approximate nearest‑neighbor search with ANNOY, LSH, or HNSW.
- Reranking stage – The shortlist is fed to a second model that scores each candidate:
- Traditional ML: features such as lexical overlap, temporal distance, and file‑path similarity fed into logistic regression or XGBoost.
- Cross‑encoder: a BERT‑style model that jointly encodes issue and commit text.
- LLM‑based: prompts are sent to ChatGPT, Qwen, Gemma, and Llama; the model returns a relevance score or binary decision.
- Evaluation metrics – Recall@k (how many true links appear in the top‑k), MAP, and precision@1 are reported for each pipeline.
- Efficiency measurement – Wall‑clock time and memory consumption are logged to gauge scalability.
Results & Findings
| Retrieval | Recall@100 | Avg. Candidates | Speed (ms/query) |
|---|---|---|---|
| BM25 | 0.62 | 1500 | 12 |
| SBERT‑HNSW | 0.78 | 300 | 8 |
| Hybrid (BM25 + SBERT) | 0.84 | 350 | 10 |
- Dense retrieval (SBERT‑HNSW) consistently outperformed sparse BM25 in recall while dramatically shrinking the candidate pool.
- Hybrid retrieval (union of top‑k from both) gave the highest recall, suggesting the two approaches capture complementary signals.
| Reranker | MAP | P@1 | Inference Time (ms) |
|---|---|---|---|
| Logistic Regression (hand‑crafted features) | 0.71 | 0.58 | 1 |
| XGBoost | 0.69 | 0.55 | 2 |
| Cross‑encoder (BERT) | 0.66 | 0.51 | 15 |
| ChatGPT (gpt‑4‑turbo) | 0.58 | 0.44 | 120 |
| Qwen / Gemma / Llama | 0.55‑0.57 | 0.40‑0.42 | 100‑130 |
- Traditional ML rerankers beat LLMs on both effectiveness and latency.
- The cross‑encoder, while better than raw retrieval, still lagged behind the lightweight models.
- LLMs added little value despite their size, and incurred a 10‑100× slowdown.
Overall, the best end‑to‑end pipeline was SBERT‑HNSW retrieval + logistic‑regression reranking, achieving a MAP of 0.71 with sub‑second response times—well suited for CI/CD integration.
Practical Implications
- Tooling for CI pipelines – Teams can embed the dense‑retrieval + lightweight‑ML reranker into automated release workflows to auto‑populate issue‑commit links, reducing manual bookkeeping.
- Cost‑effective traceability – Organizations can avoid expensive API calls to proprietary LLM services and still obtain state‑of‑the‑art linking performance.
- Scalability – Approximate nearest‑neighbor indexes (HNSW) scale to millions of commits with modest memory, making the approach viable for large monorepos.
- Hybrid retrieval as a safety net – Adding a BM25 pass catches edge‑case lexical matches that embeddings might miss, improving recall without a heavy penalty.
Limitations & Future Work
- Dataset bias – Experiments were limited to a handful of popular open‑source projects; results may differ on proprietary codebases with different naming conventions or commit granularity.
- Feature engineering reliance – The success of traditional ML rerankers hinges on handcrafted features; automating feature discovery could further boost performance.
- LLM prompting – The study used generic prompts; task‑specific prompt engineering or fine‑tuning might narrow the gap between LLMs and classic models.
- Temporal dynamics – The current pipeline does not explicitly model issue‑commit temporal ordering beyond a simple time‑gap feature; richer temporal models could improve early‑link prediction.
Bottom line: For most development teams, a dense‑retrieval front‑end paired with a simple, well‑tuned machine‑learning reranker offers the best trade‑off between accuracy, speed, and cost when automating issue‑commit linking.
Authors
- Cole Morgan
- Muhammad Asaduzzaman
- Shaiful Chowdhurry
- Shaowei Wang
Paper Information
- arXiv ID: 2605.00447v1
- Categories: cs.SE
- Published: May 1, 2026
- PDF: Download PDF