[Paper] Natural Language Summarization Enables Multi-Repository Bug Localization by LLMs in Microservice Architectures

Published: 2 months ago (December 5, 2025 at 12:42 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.05908v1

Overview

The paper tackles one of the most painful problems in modern microservice environments: finding the exact piece of code that causes a bug when the system spans dozens of repositories. By converting code into layered natural‑language summaries, the authors turn bug localization into a pure “text‑to‑text” search problem that large language models (LLMs) can handle more efficiently than traditional code‑centric techniques.

Key Contributions

Hierarchical NL Summaries: Automatic generation of concise natural‑language descriptions for every file, directory, and repository in a microservice codebase.
Two‑Phase NL‑to‑NL Search:
1. Repository routing – quickly narrows the search space to the most relevant repo(s).
2. Top‑down localization – drills from repository → directory → file using the same NL query.
Scalable Evaluation: Tested on DNext, an industrial system with 46 repositories and ~1.1 M LOC, achieving Pass@10 = 0.82 and MRR = 0.50, far surpassing classic IR baselines and agentic RAG tools such as GitHub Copilot and Cursor.
Interpretability: The search path (repo → dir → file) is exposed as plain text, giving developers a transparent view of why a particular location was suggested.
LLM‑Friendly Design: By staying within the LLM’s token window (pure NL), the approach sidesteps context‑length limits that cripple raw‑code retrieval.

Methodology

Code Summarization
- A fine‑tuned LLM (e.g., GPT‑4‑Turbo) ingests each source file and produces a short, human‑readable description (e.g., “Handles user authentication via JWT”).
- Summaries are aggregated upward: directory summaries are synthesized from their files, and repository summaries from their directories.
Index Construction
- All summaries are stored in a vector store (e.g., FAISS) together with their hierarchical identifiers.
Two‑Phase Retrieval
- Phase 1 – Repository Routing: The bug report (natural language) is embedded and matched against repository‑level summaries. The top‑k repositories are selected.
- Phase 2 – Top‑Down Localization: Within each selected repo, the same query is matched against directory summaries, then file summaries, yielding a ranked list of candidate files.
Scoring & Ranking
- Cosine similarity between query and summary embeddings provides the primary score; a lightweight re‑ranking step incorporates metadata (e.g., recent commit activity).

The whole pipeline stays within a single NL‑to‑NL pass, avoiding the need for cross‑modal embeddings (code ↔ text) that are typically noisy and heavyweight.

Results & Findings

Metric	Proposed NL‑Summaries	Traditional IR	Copilot‑RAG	Cursor‑RAG
Pass@10	0.82	0.41	0.53	0.48
MRR (Mean Reciprocal Rank)	0.50	0.22	0.31	0.28

Higher recall: The method finds the correct file in the top‑10 results 82 % of the time, a 2× improvement over plain code search.
Better ranking: MRR of 0.50 indicates that the correct file is, on average, near the top of the list.
Token efficiency: Summaries average ~30 tokens per file, keeping the total index well within LLM context limits even for large codebases.
Interpretability: Developers can read the intermediate directory and repository summaries to understand the reasoning chain, which was not possible with black‑box RAG outputs.

Practical Implications

Faster Debugging: Teams can plug the summarization pipeline into existing issue‑tracking tools (Jira, GitHub Issues) to get immediate, ranked file suggestions.
Enterprise AI Trust: The transparent hierarchy (repo → dir → file) satisfies compliance and audit requirements where “why” matters as much as “what”.
Scalable Tooling: Because the index is pure text, it can be stored in cheap vector databases and refreshed incrementally as code changes, making it viable for CI/CD pipelines.
LLM Cost Savings: By staying in the NL domain, the approach reduces the number of expensive LLM calls compared to full‑code RAG (no need to embed megabytes of source).
Cross‑Team Collaboration: In microservice ecosystems where ownership is split across many teams, the repository‑routing step automatically directs a bug report to the right owners, cutting down hand‑off friction.

Limitations & Future Work

Summary Quality Dependency: The approach hinges on the LLM’s ability to generate accurate, concise summaries; noisy or outdated summaries can mislead the search.
Dynamic Code: Rapidly changing repositories require frequent re‑summarization; the paper notes a trade‑off between freshness and compute cost.
Language Coverage: Experiments focused on a Java‑centric codebase; extending to polyglot microservices (e.g., Go, Python, Rust) may need language‑specific prompting.
Fine‑Grained Localization: The method stops at the file level; pinpointing the exact line or function remains an open challenge.
User Studies: While quantitative metrics are strong, real‑world developer adoption and perceived usefulness have yet to be measured.

Bottom line: By reframing bug localization as a pure natural‑language reasoning task, the authors demonstrate a practical, interpretable, and high‑performing alternative to traditional code search—an approach that could become a cornerstone of AI‑assisted debugging in large‑scale microservice organizations.

Authors

Amirkia Rafiei Oskooei
S. Selcan Yukcu
Mehmet Cevheri Bozoglan
Mehmet S. Aktas

Paper Information

arXiv ID: 2512.05908v1
Categories: cs.SE, cs.AI, cs.CL, cs.IR
Published: December 5, 2025
PDF: Download PDF

[Paper] Natural Language Summarization Enables Multi-Repository Bug Localization by LLMs in Microservice Architectures

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Enhancing Retrieval-Augmented Generation with Entity Linking for Educational Platforms

[Paper] M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

[Paper] Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding

[Paper] To Err Is Human: Systematic Quantification of Errors in Published AI Papers via LLM Analysis