[Paper] Think Harder and Don't Overlook Your Options: Revisiting Issue-Commit Linking with LLM-Assisted Retrieval

Published: 4 days ago (May 1, 2026 at 02:34 AM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.00447v1

Overview

Linking issue reports to the commits that fix them is a cornerstone of software traceability, yet doing it manually is tedious and error‑prone. This paper revisits a suite of classic and modern issue‑commit linking techniques, evaluates how well they retrieve and rank candidate commits, and asks whether heavyweight large language models (LLMs) actually give a measurable boost over lighter, more traditional methods.

Key Contributions

Comprehensive benchmark of five established linking pipelines (BTLink, EasyLink, FRLink, RCLinker, Hybrid‑Linker) on a common dataset.
Systematic comparison of retrieval back‑ends: sparse (BM25, BM25L) vs. dense (SBERT‑Semantic Search, ANNOY, LSH, HNSW).
Reranking study that pits traditional ML models (logistic regression, gradient‑boosted trees) and a cross‑encoder against several LLMs (ChatGPT, Qwen, Gemma, Llama).
Evidence that dense retrieval + sparse hybrid yields the best recall while keeping candidate sets small.
Finding that classic ML rerankers outperform LLMs in precision, challenging the “bigger is better” assumption for this task.

Methodology

Data preparation – The authors collected a large corpus of issue‑commit pairs from open‑source projects, splitting it into training, validation, and test folds.
Retrieval stage – Each issue is used as a query to pull a shortlist of candidate commits.
- Sparse methods rely on term‑frequency statistics (BM25/BM25L).
- Dense methods embed issues and commits into a vector space (SBERT) and perform approximate nearest‑neighbor search with ANNOY, LSH, or HNSW.
Reranking stage – The shortlist is fed to a second model that scores each candidate:
- Traditional ML: features such as lexical overlap, temporal distance, and file‑path similarity fed into logistic regression or XGBoost.
- Cross‑encoder: a BERT‑style model that jointly encodes issue and commit text.
- LLM‑based: prompts are sent to ChatGPT, Qwen, Gemma, and Llama; the model returns a relevance score or binary decision.
Evaluation metrics – Recall@k (how many true links appear in the top‑k), MAP, and precision@1 are reported for each pipeline.
Efficiency measurement – Wall‑clock time and memory consumption are logged to gauge scalability.

Results & Findings

Retrieval	Recall@100	Avg. Candidates	Speed (ms/query)
BM25	0.62	1500	12
SBERT‑HNSW	0.78	300	8
Hybrid (BM25 + SBERT)	0.84	350	10

Dense retrieval (SBERT‑HNSW) consistently outperformed sparse BM25 in recall while dramatically shrinking the candidate pool.
Hybrid retrieval (union of top‑k from both) gave the highest recall, suggesting the two approaches capture complementary signals.

Reranker	MAP	P@1	Inference Time (ms)
Logistic Regression (hand‑crafted features)	0.71	0.58	1
XGBoost	0.69	0.55	2
Cross‑encoder (BERT)	0.66	0.51	15
ChatGPT (gpt‑4‑turbo)	0.58	0.44	120
Qwen / Gemma / Llama	0.55‑0.57	0.40‑0.42	100‑130

Traditional ML rerankers beat LLMs on both effectiveness and latency.
The cross‑encoder, while better than raw retrieval, still lagged behind the lightweight models.
LLMs added little value despite their size, and incurred a 10‑100× slowdown.

Overall, the best end‑to‑end pipeline was SBERT‑HNSW retrieval + logistic‑regression reranking, achieving a MAP of 0.71 with sub‑second response times—well suited for CI/CD integration.

Practical Implications

Tooling for CI pipelines – Teams can embed the dense‑retrieval + lightweight‑ML reranker into automated release workflows to auto‑populate issue‑commit links, reducing manual bookkeeping.
Cost‑effective traceability – Organizations can avoid expensive API calls to proprietary LLM services and still obtain state‑of‑the‑art linking performance.
Scalability – Approximate nearest‑neighbor indexes (HNSW) scale to millions of commits with modest memory, making the approach viable for large monorepos.
Hybrid retrieval as a safety net – Adding a BM25 pass catches edge‑case lexical matches that embeddings might miss, improving recall without a heavy penalty.

Limitations & Future Work

Dataset bias – Experiments were limited to a handful of popular open‑source projects; results may differ on proprietary codebases with different naming conventions or commit granularity.
Feature engineering reliance – The success of traditional ML rerankers hinges on handcrafted features; automating feature discovery could further boost performance.
LLM prompting – The study used generic prompts; task‑specific prompt engineering or fine‑tuning might narrow the gap between LLMs and classic models.
Temporal dynamics – The current pipeline does not explicitly model issue‑commit temporal ordering beyond a simple time‑gap feature; richer temporal models could improve early‑link prediction.

Bottom line: For most development teams, a dense‑retrieval front‑end paired with a simple, well‑tuned machine‑learning reranker offers the best trade‑off between accuracy, speed, and cost when automating issue‑commit linking.

Authors

Cole Morgan
Muhammad Asaduzzaman
Shaiful Chowdhurry
Shaowei Wang

Paper Information

arXiv ID: 2605.00447v1
Categories: cs.SE
Published: May 1, 2026
PDF: Download PDF

[Paper] Think Harder and Don't Overlook Your Options: Revisiting Issue-Commit Linking with LLM-Assisted Retrieval

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] From Research to Practice: An Interactive Rapid Review of Autonomous Driving System Testing in Industry

[Paper] EnCoDe: Energy Estimation of Source Code At Design-Time

[Paper] Q-ARE: An Evaluation Dataset for Query Based API Recommendation

[Paper] ClozeMaster: Fuzzing Rust Compiler by Harnessing LLMs for Infilling Masked Real Programs