[Paper] Natural Language Summarization Enables Multi-Repository Bug Localization by LLMs in Microservice Architectures
Source: arXiv - 2512.05908v1
Overview
The paper tackles one of the most painful problems in modern microservice environments: finding the exact piece of code that causes a bug when the system spans dozens of repositories. By converting code into layered natural‑language summaries, the authors turn bug localization into a pure “text‑to‑text” search problem that large language models (LLMs) can handle more efficiently than traditional code‑centric techniques.
Key Contributions
- Hierarchical NL Summaries: Automatic generation of concise natural‑language descriptions for every file, directory, and repository in a microservice codebase.
- Two‑Phase NL‑to‑NL Search:
- Repository routing – quickly narrows the search space to the most relevant repo(s).
- Top‑down localization – drills from repository → directory → file using the same NL query.
- Scalable Evaluation: Tested on DNext, an industrial system with 46 repositories and ~1.1 M LOC, achieving Pass@10 = 0.82 and MRR = 0.50, far surpassing classic IR baselines and agentic RAG tools such as GitHub Copilot and Cursor.
- Interpretability: The search path (repo → dir → file) is exposed as plain text, giving developers a transparent view of why a particular location was suggested.
- LLM‑Friendly Design: By staying within the LLM’s token window (pure NL), the approach sidesteps context‑length limits that cripple raw‑code retrieval.
Methodology
- Code Summarization
- A fine‑tuned LLM (e.g., GPT‑4‑Turbo) ingests each source file and produces a short, human‑readable description (e.g., “Handles user authentication via JWT”).
- Summaries are aggregated upward: directory summaries are synthesized from their files, and repository summaries from their directories.
- Index Construction
- All summaries are stored in a vector store (e.g., FAISS) together with their hierarchical identifiers.
- Two‑Phase Retrieval
- Phase 1 – Repository Routing: The bug report (natural language) is embedded and matched against repository‑level summaries. The top‑k repositories are selected.
- Phase 2 – Top‑Down Localization: Within each selected repo, the same query is matched against directory summaries, then file summaries, yielding a ranked list of candidate files.
- Scoring & Ranking
- Cosine similarity between query and summary embeddings provides the primary score; a lightweight re‑ranking step incorporates metadata (e.g., recent commit activity).
The whole pipeline stays within a single NL‑to‑NL pass, avoiding the need for cross‑modal embeddings (code ↔ text) that are typically noisy and heavyweight.
Results & Findings
| Metric | Proposed NL‑Summaries | Traditional IR | Copilot‑RAG | Cursor‑RAG |
|---|---|---|---|---|
| Pass@10 | 0.82 | 0.41 | 0.53 | 0.48 |
| MRR (Mean Reciprocal Rank) | 0.50 | 0.22 | 0.31 | 0.28 |
- Higher recall: The method finds the correct file in the top‑10 results 82 % of the time, a 2× improvement over plain code search.
- Better ranking: MRR of 0.50 indicates that the correct file is, on average, near the top of the list.
- Token efficiency: Summaries average ~30 tokens per file, keeping the total index well within LLM context limits even for large codebases.
- Interpretability: Developers can read the intermediate directory and repository summaries to understand the reasoning chain, which was not possible with black‑box RAG outputs.
Practical Implications
- Faster Debugging: Teams can plug the summarization pipeline into existing issue‑tracking tools (Jira, GitHub Issues) to get immediate, ranked file suggestions.
- Enterprise AI Trust: The transparent hierarchy (repo → dir → file) satisfies compliance and audit requirements where “why” matters as much as “what”.
- Scalable Tooling: Because the index is pure text, it can be stored in cheap vector databases and refreshed incrementally as code changes, making it viable for CI/CD pipelines.
- LLM Cost Savings: By staying in the NL domain, the approach reduces the number of expensive LLM calls compared to full‑code RAG (no need to embed megabytes of source).
- Cross‑Team Collaboration: In microservice ecosystems where ownership is split across many teams, the repository‑routing step automatically directs a bug report to the right owners, cutting down hand‑off friction.
Limitations & Future Work
- Summary Quality Dependency: The approach hinges on the LLM’s ability to generate accurate, concise summaries; noisy or outdated summaries can mislead the search.
- Dynamic Code: Rapidly changing repositories require frequent re‑summarization; the paper notes a trade‑off between freshness and compute cost.
- Language Coverage: Experiments focused on a Java‑centric codebase; extending to polyglot microservices (e.g., Go, Python, Rust) may need language‑specific prompting.
- Fine‑Grained Localization: The method stops at the file level; pinpointing the exact line or function remains an open challenge.
- User Studies: While quantitative metrics are strong, real‑world developer adoption and perceived usefulness have yet to be measured.
Bottom line: By reframing bug localization as a pure natural‑language reasoning task, the authors demonstrate a practical, interpretable, and high‑performing alternative to traditional code search—an approach that could become a cornerstone of AI‑assisted debugging in large‑scale microservice organizations.
Authors
- Amirkia Rafiei Oskooei
- S. Selcan Yukcu
- Mehmet Cevheri Bozoglan
- Mehmet S. Aktas
Paper Information
- arXiv ID: 2512.05908v1
- Categories: cs.SE, cs.AI, cs.CL, cs.IR
- Published: December 5, 2025
- PDF: Download PDF