[Paper] Reliable Graph-RAG for Codebases: AST-Derived Graphs vs LLM-Extracted Knowledge Graphs

Published: 3 weeks ago (January 13, 2026 at 01:03 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.08773v1

Overview

This paper investigates how to make Retrieval‑Augmented Generation (RAG) more reliable for software‑engineering tasks such as navigating complex Java codebases. By comparing three retrieval pipelines—plain vector search, an LLM‑generated knowledge‑graph, and a deterministic AST‑derived graph—the author shows that a graph built directly from the abstract syntax tree (AST) gives the best coverage, accuracy, and cost‑efficiency for multi‑hop architectural queries.

Key Contributions

Three‑pipeline benchmark on real‑world Java projects (Shopizer, ThingsBoard, OpenMRS) covering indexing time, query latency, coverage, cost, and answer correctness.
Deterministic AST‑derived knowledge graph (DKB) construction using Tree‑sitter and bidirectional traversal, built in seconds with near‑complete file coverage.
Empirical evidence that LLM‑extracted knowledge graphs (LLM‑KB) miss a substantial portion of the codebase and incur higher indexing costs.
Demonstration that plain vector‑only RAG struggles with multi‑hop architectural reasoning and suffers from higher hallucination rates.
Practical guidelines for choosing a retrieval strategy when building code‑assistant tools or automated documentation systems.

Methodology

Repositories & Queries – The study uses three open‑source Java codebases (Shopizer, ThingsBoard, OpenMRS). For each, a curated set of 15 “architecture and code‑tracing” questions (e.g., “Which service does OrderController call?”) is executed.
Retrieval Pipelines
- No‑Graph RAG: Traditional vector similarity search on chunked source files.
- LLM‑KB: Prompt an LLM to extract entities (classes, methods, interfaces) and relationships, then embed the resulting knowledge‑graph nodes.
- DKB: Parse the code with Tree‑sitter, walk the AST to extract a deterministic graph of declarations, inheritance, method calls, and dependency edges; embed each node.
Metrics – For each pipeline the authors record:
- Indexing time (graph construction + embedding)
- Query latency (retrieval + generation)
- Corpus coverage (percentage of source files represented in the index)
- Cost (LLM API calls, compute time)
- Answer correctness (human‑rated on a 0‑2 scale).
Evaluation – Answers are compared against ground‑truth derived from the code and documentation. Hallucinations (incorrect but plausible statements) are logged separately.

Results & Findings

Metric	No‑Graph RAG	LLM‑KB	DKB
Indexing time	~5 min (vector embedding)	30‑45 min (LLM prompting + embedding)	< 10 s (AST parsing + embedding)
Coverage	92 % of files (some chunks missed)	84 % (377 files skipped in Shopizer)	100 % (deterministic parsing)
Query latency	1.2 s avg	2.8 s avg, high variance	1.3 s avg
Cost (per repo)	Low (vector embeddings only)	High (LLM calls dominate)	Modest (AST + cheap embeddings)
Correctness (Shopizer)	0.78 avg score (worst on multi‑hop)	1.45 avg score	1.62 avg score (best)
Hallucination rate	27 % of answers	12 %	8 %

Takeaway: The deterministic AST‑derived graph consistently outperforms the LLM‑generated graph in coverage and answer quality while being far cheaper and faster to build. Plain vector search lags behind on any query that requires reasoning across multiple code entities.

Practical Implications

Tooling for developers – IDE plugins or chat‑based code assistants can adopt AST‑derived graphs to provide accurate, multi‑step navigation (e.g., “Show the full call chain from a controller to the DB layer”).
Cost‑effective scaling – Since DKB indexing is seconds‑fast and cheap, large monorepos can be refreshed frequently without prohibitive cloud expenses.
Reduced hallucinations – Deterministic graphs give the LLM a concrete grounding, lowering the chance of fabricating non‑existent relationships—a critical safety factor for production‑grade code‑assistants.
Hybrid pipelines – Teams could keep a lightweight vector store for topical search and overlay the AST graph for structural queries, achieving the best of both worlds.
Compliance & Auditing – A deterministic graph provides a verifiable artifact (the AST) that can be audited, unlike opaque LLM‑generated knowledge bases.

Limitations & Future Work

Language scope – Experiments are limited to Java; extending to dynamically typed languages (Python, JavaScript) may require additional heuristics.
Graph richness – The current AST‑derived graph captures declarations and call edges but omits runtime information (e.g., reflection, DI container wiring) that could affect accuracy.
LLM prompt engineering – The study uses a single prompting strategy for LLM‑KB; more sophisticated prompts or fine‑tuned models might close the coverage gap.
User‑study validation – Human‑centric evaluations (developer satisfaction, time‑to‑solution) are left for future work.
Integration with CI/CD – Automating DKB updates in continuous integration pipelines and measuring impact on code review assistance remains an open avenue.

Authors

Manideep Reddy Chinthareddy

Paper Information

arXiv ID: 2601.08773v1
Categories: cs.SE, cs.AI
Published: January 13, 2026
PDF: Download PDF

[Paper] Reliable Graph-RAG for Codebases: AST-Derived Graphs vs LLM-Extracted Knowledge Graphs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Do explanations generalize across large reasoning models?

[Paper] Building Production-Ready Probes For Gemini

[Paper] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

[Paper] MetaboNet: The Largest Publicly Available Consolidated Dataset for Type 1 Diabetes Management