[Paper] Natural Language Summarization Enables Multi-Repository Bug Localization by LLMs in Microservice Architectures

Published: (December 5, 2025 at 12:42 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.05908v1

Overview

The paper tackles one of the most painful problems in modern microservice environments: finding the exact piece of code that causes a bug when the system spans dozens of repositories. By converting code into layered natural‑language summaries, the authors turn bug localization into a pure “text‑to‑text” search problem that large language models (LLMs) can handle more efficiently than traditional code‑centric techniques.

Key Contributions

  • Hierarchical NL Summaries: Automatic generation of concise natural‑language descriptions for every file, directory, and repository in a microservice codebase.
  • Two‑Phase NL‑to‑NL Search:
    1. Repository routing – quickly narrows the search space to the most relevant repo(s).
    2. Top‑down localization – drills from repository → directory → file using the same NL query.
  • Scalable Evaluation: Tested on DNext, an industrial system with 46 repositories and ~1.1 M LOC, achieving Pass@10 = 0.82 and MRR = 0.50, far surpassing classic IR baselines and agentic RAG tools such as GitHub Copilot and Cursor.
  • Interpretability: The search path (repo → dir → file) is exposed as plain text, giving developers a transparent view of why a particular location was suggested.
  • LLM‑Friendly Design: By staying within the LLM’s token window (pure NL), the approach sidesteps context‑length limits that cripple raw‑code retrieval.

Methodology

  1. Code Summarization
    • A fine‑tuned LLM (e.g., GPT‑4‑Turbo) ingests each source file and produces a short, human‑readable description (e.g., “Handles user authentication via JWT”).
    • Summaries are aggregated upward: directory summaries are synthesized from their files, and repository summaries from their directories.
  2. Index Construction
    • All summaries are stored in a vector store (e.g., FAISS) together with their hierarchical identifiers.
  3. Two‑Phase Retrieval
    • Phase 1 – Repository Routing: The bug report (natural language) is embedded and matched against repository‑level summaries. The top‑k repositories are selected.
    • Phase 2 – Top‑Down Localization: Within each selected repo, the same query is matched against directory summaries, then file summaries, yielding a ranked list of candidate files.
  4. Scoring & Ranking
    • Cosine similarity between query and summary embeddings provides the primary score; a lightweight re‑ranking step incorporates metadata (e.g., recent commit activity).

The whole pipeline stays within a single NL‑to‑NL pass, avoiding the need for cross‑modal embeddings (code ↔ text) that are typically noisy and heavyweight.

Results & Findings

MetricProposed NL‑SummariesTraditional IRCopilot‑RAGCursor‑RAG
Pass@100.820.410.530.48
MRR (Mean Reciprocal Rank)0.500.220.310.28
  • Higher recall: The method finds the correct file in the top‑10 results 82 % of the time, a 2× improvement over plain code search.
  • Better ranking: MRR of 0.50 indicates that the correct file is, on average, near the top of the list.
  • Token efficiency: Summaries average ~30 tokens per file, keeping the total index well within LLM context limits even for large codebases.
  • Interpretability: Developers can read the intermediate directory and repository summaries to understand the reasoning chain, which was not possible with black‑box RAG outputs.

Practical Implications

  • Faster Debugging: Teams can plug the summarization pipeline into existing issue‑tracking tools (Jira, GitHub Issues) to get immediate, ranked file suggestions.
  • Enterprise AI Trust: The transparent hierarchy (repo → dir → file) satisfies compliance and audit requirements where “why” matters as much as “what”.
  • Scalable Tooling: Because the index is pure text, it can be stored in cheap vector databases and refreshed incrementally as code changes, making it viable for CI/CD pipelines.
  • LLM Cost Savings: By staying in the NL domain, the approach reduces the number of expensive LLM calls compared to full‑code RAG (no need to embed megabytes of source).
  • Cross‑Team Collaboration: In microservice ecosystems where ownership is split across many teams, the repository‑routing step automatically directs a bug report to the right owners, cutting down hand‑off friction.

Limitations & Future Work

  • Summary Quality Dependency: The approach hinges on the LLM’s ability to generate accurate, concise summaries; noisy or outdated summaries can mislead the search.
  • Dynamic Code: Rapidly changing repositories require frequent re‑summarization; the paper notes a trade‑off between freshness and compute cost.
  • Language Coverage: Experiments focused on a Java‑centric codebase; extending to polyglot microservices (e.g., Go, Python, Rust) may need language‑specific prompting.
  • Fine‑Grained Localization: The method stops at the file level; pinpointing the exact line or function remains an open challenge.
  • User Studies: While quantitative metrics are strong, real‑world developer adoption and perceived usefulness have yet to be measured.

Bottom line: By reframing bug localization as a pure natural‑language reasoning task, the authors demonstrate a practical, interpretable, and high‑performing alternative to traditional code search—an approach that could become a cornerstone of AI‑assisted debugging in large‑scale microservice organizations.

Authors

  • Amirkia Rafiei Oskooei
  • S. Selcan Yukcu
  • Mehmet Cevheri Bozoglan
  • Mehmet S. Aktas

Paper Information

  • arXiv ID: 2512.05908v1
  • Categories: cs.SE, cs.AI, cs.CL, cs.IR
  • Published: December 5, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »