Understanding Context and Contextual Retrieval in RAG
Source: Towards Data Science
Hybrid Search + RAG: Why It Matters
In my recent post, RAG with Hybrid Search – How Does Keyword Search Work?, I explained how adding a keyword‑search component (e.g., BM25) to a Retrieval‑Augmented Generation (RAG) pipeline can dramatically boost its effectiveness.
- Semantic search alone – works well on embeddings, letting us tap into AI‑driven insights from our own documents.
- The problem – In large knowledge bases, pure semantic search can miss exact phrase matches that do exist in the source material.
- Hybrid search – Combines semantic embeddings with keyword matching, delivering more comprehensive results and a noticeable performance lift for RAG systems.

The Context‑Loss Issue in Chunk‑Based Retrieval
Even with hybrid search, RAG can still overlook crucial information that is scattered across a document. This happens because:
- Chunking removes surrounding text – When a document is split into smaller pieces, the context that gives each chunk meaning can be lost.
- Complex, inter‑connected content suffers – References to tables, figures, or concepts that span multiple sections (e.g., “as shown in the Table, profits increased by 6%”) become ambiguous once the surrounding context is stripped away.
- Resulting retrieval errors – Irrelevant chunks may be fetched, leading to off‑topic or incorrect model responses.
Common (but insufficient) Remedies
| Approach | What it does | Drawbacks |
|---|---|---|
| Increase chunk size | Keeps more surrounding text in each piece. | Dilutes semantic focus; retrieval becomes less precise. |
| Increase chunk overlap | Repeats portions of text across adjacent chunks. | Higher storage & compute costs; still cannot capture cross‑boundary relationships. |
| Hypothetical Document Embeddings (HyDE) | Generates a synthetic “ideal” document for each query, then embeds it. | Improves recall modestly but does not fully solve context loss. |
| Document Summary Index | Stores concise summaries of each document for faster lookup. | Summaries may omit nuanced connections needed for accurate answers. |
While these techniques help, they don’t fully address the root cause: loss of the broader narrative that ties individual chunks together.
The Real Solution: Contextual Retrieval
Contextual retrieval – introduced by Anthropic in 2024 – preserves the surrounding context of each chunk during the retrieval step, dramatically improving RAG accuracy.
Read the original paper: Anthropic – Contextual Retrieval (2024)
How it works
- Chunk‑level context windows – When a chunk is retrieved, the system also fetches a configurable amount of preceding and following text.
- Dynamic context stitching – Overlapping windows are merged on‑the‑fly, reconstructing the original narrative around the retrieved passage.
- Selective expansion – Only the most relevant context is added, keeping token usage efficient while still providing the model with the necessary background.
Benefits
- Higher relevance – The model sees the full story, reducing hallucinations and off‑topic answers.
- Lower latency than full‑document retrieval – Only the needed context is added, not the entire document.
- Compatibility with hybrid search – Works seamlessly with both semantic embeddings and keyword‑based BM25 scores.
TL;DR
- Hybrid search (semantic + keyword) already improves RAG, but chunking still discards essential context.
- Traditional fixes (larger chunks, overlap, HyDE, summary indexes) only provide marginal gains.
- Contextual retrieval restores surrounding text at query time, giving the LLM the full picture it needs for accurate, grounded responses.
Implementing contextual retrieval is currently the most effective way to overcome the context‑loss problem and unlock the full potential of RAG pipelines.
What About Context?
Before diving into contextual retrieval, let’s step back and clarify what context actually means for large language models (LLMs).
We’ve all heard about “context windows,” but what does that term refer to?
Definition
Context = all the tokens that are available to the LLM when it predicts the next token.
In practice, this includes:
- The user prompt
- The system prompt
- Instructions, skills, or any other guidelines that influence the model’s output
- The portion of the model’s own response that has already been generated (each new token is generated based on everything that came before it)
Because LLMs generate text one token at a time, the entire history of the conversation is part of the context.
Why Context Matters
Small changes in context can lead to very different completions:
| Prompt fragment | Likely continuation |
|---|---|
| “I went to a restaurant and ordered a” | pizza. |
| “I went to the pharmacy and bought a” | medicine. |
The Context‑Window Limitation
The context window is the maximum number of tokens an LLM can consider in a single request.
- Early models: as few as 8 k tokens
- Modern frontier models: hundreds of thousands of tokens
For example, Claude Opus 4.6 (200 k‑token window) can ingest roughly 500–600 pages of text in one go. If all the information you need fits within that limit, you can simply feed it to the model and expect a strong answer.
When the Knowledge Base Is Larger Than the Window
Most real‑world use cases involve knowledge bases far larger than any context window (e.g., legal libraries, equipment manuals). Because we can’t pass everything to the model, we must select the most relevant pieces of information to include. This selection process is the core of Retrieval‑Augmented Generation (RAG) and is often referred to as context engineering:
Identifying the optimal subset of information to place inside a limited context window so the model can produce the best possible response.
— LangChain Context Engineering docs
The Retrieval Step in RAG
The most crucial part of a RAG system is ensuring that the right information is retrieved and fed to the LLM. Retrieval can be performed via:
- Semantic search – finds chunks with similar meaning
- Keyword search – finds exact‑match terms
Even after combining both methods, some important information may still be omitted.
What Kind of Information Might Be Missed?
Consider two documents with different domains that contain identical phrasing:
“Heat the mixture slowly.”
- In a recipe book, this refers to cooking.
- In a chemical‑processing manual, it refers to an industrial procedure.
The semantic meaning (the act of heating) is the same, but the context (cooking vs. chemical engineering) is different. Preserving that surrounding context is essential for accurate retrieval.
Contextual Retrieval
Contextual retrieval aims to keep the surrounding meaning of each text chunk, ensuring that the model receives not just the relevant sentence but also the domain‑specific context that disambiguates it.

What About Contextual Retrieval?
Contextual retrieval is a methodology used in Retrieval‑Augmented Generation (RAG) to preserve the context of each chunk. When a chunk is retrieved and passed to the LLM, we want to keep as much of its original meaning as possible—the semantics, the keywords, and the surrounding context.
How It Works
- Generate a helper text for each chunk – the contextual text – that situates the chunk within its source document.
- Prompt an LLM to produce this contextual text by providing both the full document and the specific chunk.
- Combine the returned contextual text with the original chunk; the pair is then treated as an inseparable unit for indexing and retrieval.
Prompt Example
Below is a prompt that can be sent to an LLM to obtain the contextual text for a chunk from an Italian Cookbook:
[the entire Italian Cookbook document the chunk comes from]
Here is the chunk we want to place within the context of the full document.
[the actual chunk]
Provide a brief context that situates this chunk within the overall document to improve search retrieval. Respond only with the concise context and nothing else.
Expected LLM Response
The LLM returns a short contextual description, which we prepend to the chunk:
Context: Recipe step for simmering homemade tomato pasta sauce.
Chunk: Heat the mixture slowly and stir occasionally to prevent it from sticking.
Now the retrieval system knows exactly what “the mixture” refers to, eliminating ambiguity between tomato sauce and, say, a laboratory starch solution.
Indexing
From this point onward, the context + chunk pair is treated as a single unit:
- Embeddings are generated for the combined text and stored in a vector store.
- BM25 (or another lexical index) is built on the same combined text.
Both the dense and sparse indexes therefore contain the contextual information, which dramatically improves relevance.
Impact
According to Anthropic, contextual retrieval can boost retrieval accuracy by ~35 %【1】.
Reducing Cost with Prompt Caching
I hear you asking, “But isn’t this going to cost a fortune?”. Surprisingly, no.
Intuitively, we understand that this setup will significantly increase the cost of ingestion for a RAG pipeline—essentially double it, if not more. After all, we have added a bunch of extra calls to the LLM, didn’t we?
This is true to some extent: for each chunk we now make an additional call to the LLM in order to situate it within its source document and retrieve the contextual text.
Why the Extra Cost Is Only Ingest‑Time
- One‑time expense – The additional LLM calls occur only during document ingestion.
- No runtime overhead – Unlike techniques that preserve context at query time (e.g., Hypothetical Document Embeddings, HyDE), contextual retrieval does the heavy lifting up front.
- Scalability – Runtime approaches require extra LLM calls for every user query, quickly inflating latency and operational costs.
By shifting the computation to the ingestion phase, we obtain higher retrieval quality without any additional overhead during runtime.
Further Cost Reductions
- Prompt caching – Generate the document summary once, then cache it. Each chunk can be situated against the cached summary rather than re‑generating it.
- Batch processing – Group chunks when calling the LLM to take advantage of token‑level discounts some providers offer.
- Selective caching – Cache only the most frequently accessed documents or those with high retrieval importance.
In short, while contextual retrieval adds some upfront cost, strategic use of prompt caching and other optimizations keeps the overall expense low and eliminates extra runtime charges.
On My Mind
Contextual retrieval represents a simple yet powerful improvement to traditional RAG systems. By enriching each chunk with contextual text—pinpointing its semantic position within its source document—we dramatically reduce the ambiguity of each chunk and improve the quality of the information passed to the LLM. Combined with hybrid search, this technique lets us preserve semantics, keywords, and context simultaneously.
Loved this post? Let’s be friends! Join me on:
- 📰 Substack
- 💌 Medium
- ☕ Buy me a coffee
All images by the author, unless noted otherwise.