[Paper] SweRank+: Multilingual, Multi-Turn Code Ranking for Software Issue Localization

Published: 1 month ago (December 23, 2025 at 11:18 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.20482v1

Overview

SweRank+ tackles a pain point for anyone working with large, multilingual codebases: turning a natural‑language bug report or feature request into the exact function(s) that need fixing. By combining a cross‑lingual code ranking engine (SweRankMulti) with an iterative “agentic” search loop (SweRankAgent), the authors push issue‑localization accuracy well beyond the current state of the art—especially for languages beyond Python.

Key Contributions

SweRankMulti: a two‑stage ranking pipeline (dense retriever + LLM‑based listwise reranker) trained on a massive, multilingual issue‑localization dataset covering the most popular programming languages.
SweRankAgent: an agentic search framework that performs multi‑turn reasoning, storing intermediate candidates in a memory buffer and refining the search over several iterations.
Large‑scale multilingual dataset: curated from real‑world issue trackers, providing high‑quality training signals for 10+ languages (Python, Java, JavaScript, Go, C#, etc.).
State‑of‑the‑art results: SweRankMulti alone beats prior single‑pass baselines on all benchmark languages; SweRankAgent adds another 3–7 % boost in top‑k accuracy.
Open‑source release: code, models, and the dataset are publicly available, enabling reproducibility and downstream tooling.

Methodology

Embedding Retriever – Each function in the repository is encoded with a language‑agnostic code encoder (a transformer fine‑tuned on code‑comment pairs). The bug description is encoded the same way, and a fast approximate nearest‑neighbor search returns the top‑N candidate functions.
Listwise LLM Reranker – The N candidates, together with the original issue text, are fed to a large language model (LLM) that scores the entire list jointly, allowing the model to consider cross‑candidate interactions (e.g., “if function A is relevant, function B is less likely”).
Agentic Search Loop (SweRankAgent) – Instead of stopping after one pass, the system keeps a memory buffer of previously examined candidates. At each turn, the agent:
- Queries the retriever with an updated prompt that incorporates insights from the buffer (e.g., “the issue mentions a null pointer; prioritize functions handling pointers”).
- Reranks the new batch with the LLM.
- Updates the buffer with the highest‑scoring candidates.
The loop runs for a fixed number of turns (typically 3–5) or until convergence, effectively performing a coarse‑to‑fine search.

All components are trained end‑to‑end on the multilingual dataset, with the retriever optimized via contrastive loss and the reranker via listwise cross‑entropy.

Results & Findings

Benchmark	Language	Top‑1 Acc. (Prev. SOTA)	Top‑1 Acc. (SweRankMulti)	Top‑1 Acc. (SweRankAgent)
Defects4J	Java	58.2 %	66.7 %	71.3 %
BugsJS	JavaScript	49.5 %	57.9 %	62.4 %
GoBugs	Go	45.1 %	53.2 %	58.0 %
Multi‑Lang (10 langs)	–	52.3 % (average)	60.8 %	65.5 %

Cross‑lingual transfer: Training on the combined dataset improves low‑resource languages (e.g., Rust, Kotlin) by >10 % absolute.
Multi‑turn gains: The agentic loop consistently adds 3–7 % top‑k accuracy, confirming that iterative reasoning helps resolve ambiguous or noisy issue descriptions.
Efficiency: Despite the extra turns, average latency remains under 1 s per query on a single GPU, thanks to the fast ANN retriever and batched LLM inference.

Practical Implications

Faster triage: Integrating SweRank+ into CI/CD pipelines can auto‑suggest the exact function(s) to inspect when a new issue lands, cutting down manual search time dramatically.
Cross‑language codebases: Companies with polyglot stacks (e.g., microservices in Java, Go, and Node.js) can use a single model rather than maintaining language‑specific tools.
Developer assistants: IDE plugins could surface ranked function candidates as you type an error description, turning natural‑language debugging into a guided code navigation experience.
Security & compliance: When a vulnerability is reported in a high‑level description, SweRank+ can quickly pinpoint the affected code paths across languages, accelerating patch deployment.

Limitations & Future Work

Dependency on high‑quality issue data: The model’s performance drops when issue descriptions are extremely terse or lack domain terminology.
Scalability to massive monorepos: While the ANN index scales well, memory consumption for the agentic buffer grows with repository size; smarter pruning strategies are needed.
LLM cost: The listwise reranker relies on a large LLM, which may be prohibitive for on‑premise deployments without quantization or distillation.
Future directions: The authors plan to explore (1) few‑shot adaptation for proprietary codebases, (2) tighter integration with static analysis tools to enrich the candidate set, and (3) open‑source lighter‑weight rerankers that retain most of the accuracy gains.

Authors

Revanth Gangi Reddy
Ye Liu
Wenting Zhao
JaeHyeok Doo
Tarun Suresh
Daniel Lee
Caiming Xiong
Yingbo Zhou
Semih Yavuz
Shafiq Joty

Paper Information

arXiv ID: 2512.20482v1
Categories: cs.SE, cs.AI
Published: December 23, 2025
PDF: Download PDF

[Paper] SweRank+: Multilingual, Multi-Turn Code Ranking for Software Issue Localization

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Agentic Structured Graph Traversal for Root Cause Analysis of Code-related Incidents in Cloud Applications

[Paper] Pruning as a Game: Equilibrium-Driven Sparsification of Neural Networks

[Paper] Explainable Multimodal Regression via Information Decomposition

[Paper] A2P-Vis: an Analyzer-to-Presenter Agentic Pipeline for Visual Insights Generation and Reporting