How GraphRAG Works
Source: Dev.to
Indexing Phase (Offline, Expensive but Done Once)
- Text Chunking – Split the input text into manageable chunks.
- Entity Extraction – Use an LLM to identify entities (people, places, organizations, concepts) and relationships from each chunk.
- Build Knowledge Graph – Create a graph where nodes are entities and edges are relationships (with descriptions).
- Community Detection – Apply graph algorithms (e.g., Leiden algorithm) to identify clusters of closely related entities (communities).
- Hierarchical Summarization – Generate summaries for each community at multiple levels (bottom‑up hierarchy: detailed low‑level communities → higher‑level aggregated summaries).
The result is a structured index: the graph plus pre‑generated community summaries. This captures implicit connections across the entire dataset that vector embeddings alone miss.
Querying Phase
- Local Queries (specific details) – Retrieve relevant subgraphs or text chunks near mentioned entities.
- Global Queries (broad understanding) –
- Select relevant community summaries (based on similarity to the query).
- Use the LLM to generate partial answers from each summary.
- Aggregate and summarize the partial answers into a final coherent response.
This “map‑reduce” style over communities enables holistic reasoning.
Why It’s Better Than Standard RAG
- Comprehensiveness – Captures broader themes and connections, leading to more complete answers.
- Diversity – Reduces repetition and surfaces varied perspectives.
- Empowerment – Provides grounded, evidence‑based insights for complex datasets (e.g., conflicting news sources).
Experiments in the original paper (datasets ≈ 1 million tokens) show GraphRAG outperforming baseline RAG by 70–80 % on metrics such as comprehensiveness and diversity for global questions.
Practical Details
- Open‑source implementation: microsoft/graphrag on GitHub
- Costs – Indexing is LLM‑intensive (many calls for extraction and summarization), but querying is efficient.
- Later improvements – Variants such as LazyGraphRAG (more cost‑efficient), DRIFT search, dynamic community selection, and auto‑tuning for new domains.
Summary
GraphRAG represents a major advancement in enabling LLMs to reason over large, private, narrative‑rich datasets by leveraging graph structures for “global sensemaking.” It is especially valuable when standard RAG yields incomplete or superficial answers.