[Paper] SweRank+: Multilingual, Multi-Turn Code Ranking for Software Issue Localization

Published: (December 23, 2025 at 11:18 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.20482v1

Overview

SweRank+ tackles a pain point for anyone working with large, multilingual codebases: turning a natural‑language bug report or feature request into the exact function(s) that need fixing. By combining a cross‑lingual code ranking engine (SweRankMulti) with an iterative “agentic” search loop (SweRankAgent), the authors push issue‑localization accuracy well beyond the current state of the art—especially for languages beyond Python.

Key Contributions

  • SweRankMulti: a two‑stage ranking pipeline (dense retriever + LLM‑based listwise reranker) trained on a massive, multilingual issue‑localization dataset covering the most popular programming languages.
  • SweRankAgent: an agentic search framework that performs multi‑turn reasoning, storing intermediate candidates in a memory buffer and refining the search over several iterations.
  • Large‑scale multilingual dataset: curated from real‑world issue trackers, providing high‑quality training signals for 10+ languages (Python, Java, JavaScript, Go, C#, etc.).
  • State‑of‑the‑art results: SweRankMulti alone beats prior single‑pass baselines on all benchmark languages; SweRankAgent adds another 3–7 % boost in top‑k accuracy.
  • Open‑source release: code, models, and the dataset are publicly available, enabling reproducibility and downstream tooling.

Methodology

  1. Embedding Retriever – Each function in the repository is encoded with a language‑agnostic code encoder (a transformer fine‑tuned on code‑comment pairs). The bug description is encoded the same way, and a fast approximate nearest‑neighbor search returns the top‑N candidate functions.

  2. Listwise LLM Reranker – The N candidates, together with the original issue text, are fed to a large language model (LLM) that scores the entire list jointly, allowing the model to consider cross‑candidate interactions (e.g., “if function A is relevant, function B is less likely”).

  3. Agentic Search Loop (SweRankAgent) – Instead of stopping after one pass, the system keeps a memory buffer of previously examined candidates. At each turn, the agent:

    • Queries the retriever with an updated prompt that incorporates insights from the buffer (e.g., “the issue mentions a null pointer; prioritize functions handling pointers”).
    • Reranks the new batch with the LLM.
    • Updates the buffer with the highest‑scoring candidates.

    The loop runs for a fixed number of turns (typically 3–5) or until convergence, effectively performing a coarse‑to‑fine search.

All components are trained end‑to‑end on the multilingual dataset, with the retriever optimized via contrastive loss and the reranker via listwise cross‑entropy.

Results & Findings

BenchmarkLanguageTop‑1 Acc. (Prev. SOTA)Top‑1 Acc. (SweRankMulti)Top‑1 Acc. (SweRankAgent)
Defects4JJava58.2 %66.7 %71.3 %
BugsJSJavaScript49.5 %57.9 %62.4 %
GoBugsGo45.1 %53.2 %58.0 %
Multi‑Lang (10 langs)52.3 % (average)60.8 %65.5 %
  • Cross‑lingual transfer: Training on the combined dataset improves low‑resource languages (e.g., Rust, Kotlin) by >10 % absolute.
  • Multi‑turn gains: The agentic loop consistently adds 3–7 % top‑k accuracy, confirming that iterative reasoning helps resolve ambiguous or noisy issue descriptions.
  • Efficiency: Despite the extra turns, average latency remains under 1 s per query on a single GPU, thanks to the fast ANN retriever and batched LLM inference.

Practical Implications

  • Faster triage: Integrating SweRank+ into CI/CD pipelines can auto‑suggest the exact function(s) to inspect when a new issue lands, cutting down manual search time dramatically.
  • Cross‑language codebases: Companies with polyglot stacks (e.g., microservices in Java, Go, and Node.js) can use a single model rather than maintaining language‑specific tools.
  • Developer assistants: IDE plugins could surface ranked function candidates as you type an error description, turning natural‑language debugging into a guided code navigation experience.
  • Security & compliance: When a vulnerability is reported in a high‑level description, SweRank+ can quickly pinpoint the affected code paths across languages, accelerating patch deployment.

Limitations & Future Work

  • Dependency on high‑quality issue data: The model’s performance drops when issue descriptions are extremely terse or lack domain terminology.
  • Scalability to massive monorepos: While the ANN index scales well, memory consumption for the agentic buffer grows with repository size; smarter pruning strategies are needed.
  • LLM cost: The listwise reranker relies on a large LLM, which may be prohibitive for on‑premise deployments without quantization or distillation.
  • Future directions: The authors plan to explore (1) few‑shot adaptation for proprietary codebases, (2) tighter integration with static analysis tools to enrich the candidate set, and (3) open‑source lighter‑weight rerankers that retain most of the accuracy gains.

Authors

  • Revanth Gangi Reddy
  • Ye Liu
  • Wenting Zhao
  • JaeHyeok Doo
  • Tarun Suresh
  • Daniel Lee
  • Caiming Xiong
  • Yingbo Zhou
  • Semih Yavuz
  • Shafiq Joty

Paper Information

  • arXiv ID: 2512.20482v1
  • Categories: cs.SE, cs.AI
  • Published: December 23, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »