[Paper] Think Harder and Don't Overlook Your Options: Revisiting Issue-Commit Linking with LLM-Assisted Retrieval

Published: (May 1, 2026 at 02:34 AM EDT)
5 min read
Source: arXiv

Source: arXiv - 2605.00447v1

Overview

Linking issue reports to the commits that fix them is a cornerstone of software traceability, yet doing it manually is tedious and error‑prone. This paper revisits a suite of classic and modern issue‑commit linking techniques, evaluates how well they retrieve and rank candidate commits, and asks whether heavyweight large language models (LLMs) actually give a measurable boost over lighter, more traditional methods.

Key Contributions

  • Comprehensive benchmark of five established linking pipelines (BTLink, EasyLink, FRLink, RCLinker, Hybrid‑Linker) on a common dataset.
  • Systematic comparison of retrieval back‑ends: sparse (BM25, BM25L) vs. dense (SBERT‑Semantic Search, ANNOY, LSH, HNSW).
  • Reranking study that pits traditional ML models (logistic regression, gradient‑boosted trees) and a cross‑encoder against several LLMs (ChatGPT, Qwen, Gemma, Llama).
  • Evidence that dense retrieval + sparse hybrid yields the best recall while keeping candidate sets small.
  • Finding that classic ML rerankers outperform LLMs in precision, challenging the “bigger is better” assumption for this task.

Methodology

  1. Data preparation – The authors collected a large corpus of issue‑commit pairs from open‑source projects, splitting it into training, validation, and test folds.
  2. Retrieval stage – Each issue is used as a query to pull a shortlist of candidate commits.
    • Sparse methods rely on term‑frequency statistics (BM25/BM25L).
    • Dense methods embed issues and commits into a vector space (SBERT) and perform approximate nearest‑neighbor search with ANNOY, LSH, or HNSW.
  3. Reranking stage – The shortlist is fed to a second model that scores each candidate:
    • Traditional ML: features such as lexical overlap, temporal distance, and file‑path similarity fed into logistic regression or XGBoost.
    • Cross‑encoder: a BERT‑style model that jointly encodes issue and commit text.
    • LLM‑based: prompts are sent to ChatGPT, Qwen, Gemma, and Llama; the model returns a relevance score or binary decision.
  4. Evaluation metrics – Recall@k (how many true links appear in the top‑k), MAP, and precision@1 are reported for each pipeline.
  5. Efficiency measurement – Wall‑clock time and memory consumption are logged to gauge scalability.

Results & Findings

RetrievalRecall@100Avg. CandidatesSpeed (ms/query)
BM250.62150012
SBERT‑HNSW0.783008
Hybrid (BM25 + SBERT)0.8435010
  • Dense retrieval (SBERT‑HNSW) consistently outperformed sparse BM25 in recall while dramatically shrinking the candidate pool.
  • Hybrid retrieval (union of top‑k from both) gave the highest recall, suggesting the two approaches capture complementary signals.
RerankerMAPP@1Inference Time (ms)
Logistic Regression (hand‑crafted features)0.710.581
XGBoost0.690.552
Cross‑encoder (BERT)0.660.5115
ChatGPT (gpt‑4‑turbo)0.580.44120
Qwen / Gemma / Llama0.55‑0.570.40‑0.42100‑130
  • Traditional ML rerankers beat LLMs on both effectiveness and latency.
  • The cross‑encoder, while better than raw retrieval, still lagged behind the lightweight models.
  • LLMs added little value despite their size, and incurred a 10‑100× slowdown.

Overall, the best end‑to‑end pipeline was SBERT‑HNSW retrieval + logistic‑regression reranking, achieving a MAP of 0.71 with sub‑second response times—well suited for CI/CD integration.

Practical Implications

  • Tooling for CI pipelines – Teams can embed the dense‑retrieval + lightweight‑ML reranker into automated release workflows to auto‑populate issue‑commit links, reducing manual bookkeeping.
  • Cost‑effective traceability – Organizations can avoid expensive API calls to proprietary LLM services and still obtain state‑of‑the‑art linking performance.
  • Scalability – Approximate nearest‑neighbor indexes (HNSW) scale to millions of commits with modest memory, making the approach viable for large monorepos.
  • Hybrid retrieval as a safety net – Adding a BM25 pass catches edge‑case lexical matches that embeddings might miss, improving recall without a heavy penalty.

Limitations & Future Work

  • Dataset bias – Experiments were limited to a handful of popular open‑source projects; results may differ on proprietary codebases with different naming conventions or commit granularity.
  • Feature engineering reliance – The success of traditional ML rerankers hinges on handcrafted features; automating feature discovery could further boost performance.
  • LLM prompting – The study used generic prompts; task‑specific prompt engineering or fine‑tuning might narrow the gap between LLMs and classic models.
  • Temporal dynamics – The current pipeline does not explicitly model issue‑commit temporal ordering beyond a simple time‑gap feature; richer temporal models could improve early‑link prediction.

Bottom line: For most development teams, a dense‑retrieval front‑end paired with a simple, well‑tuned machine‑learning reranker offers the best trade‑off between accuracy, speed, and cost when automating issue‑commit linking.

Authors

  • Cole Morgan
  • Muhammad Asaduzzaman
  • Shaiful Chowdhurry
  • Shaowei Wang

Paper Information

  • arXiv ID: 2605.00447v1
  • Categories: cs.SE
  • Published: May 1, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »