How to Build Agentic RAG with Hybrid Search
Source: Towards Data Science
Retrieval‑Augmented Generation (RAG) and Hybrid Search
RAG, also known as Hybrid Search, is a powerful technique for retrieving relevant documents from a corpus and feeding those chunks to a Large Language Model (LLM) to answer user queries.
How Traditional RAG Works
- Vector similarity is used to locate semantically similar document chunks.
- The most relevant chunks are passed to the LLM, which generates a response.
This approach works well in many cases because semantic similarity can capture nuanced meanings.
When Vector Similarity Falls Short
- Users provide specific keywords, IDs, or exact phrases that must be matched verbatim.
- Pure semantic search may miss these exact matches, leading to incomplete or inaccurate answers.
Introducing Keyword (Hybrid) Search
Hybrid search combines:
- Keyword search (exact matching)
- Vector similarity (semantic matching)
By leveraging both methods, you can retrieve the most relevant chunks even when exact terms are required.
What You’ll Learn
- Why hybrid search improves RAG performance in keyword‑heavy scenarios.
- How to implement an agentic RAG system that dynamically decides between keyword and vector retrieval.
- Practical steps and code snippets to build the hybrid pipeline.
Visual Overview

Infographic summarizing the main contents of this article. Image by Gemini.
Why Use Hybrid Search?
Vector similarity is a powerful tool for retrieving relevant chunks from a corpus, even when the input prompt contains typos or synonyms (e.g., lift instead of elevator). However, it has notable limitations:
- Keyword Sensitivity: Vector models do not give special weight to individual words or identifiers. As a result, keywords or IDs can be “drowned out” by other semantically related terms, making it hard for pure semantic search to surface the most relevant documents.
- Exact Matching: When the user’s query includes specific terms—such as product codes, serial numbers, or unique names—vector similarity alone often fails to prioritize the matching documents.
Strengths of Keyword Search
Traditional keyword‑based methods (e.g., BM25) excel at:
- Exact Term Matching: If a word appears in only one document, that document receives a high relevance score when the term is present in the query.
- Identifier Retrieval: Unique IDs or codes are directly matched, ensuring the correct document surfaces.
Benefits of a Hybrid Approach
Combining vector similarity with keyword search gives you the best of both worlds:
- Broader Coverage – Semantic similarity captures relevant content even with paraphrasing or misspellings.
- Precise Retrieval – Keyword scoring ensures that exact terms and identifiers are not overlooked.
- Higher Relevance – The hybrid score balances the two signals, delivering results that are both contextually appropriate and precisely matched to the user’s intent.
Bottom line: Use hybrid search when you want to handle both fuzzy, semantic queries and exact keyword or identifier look‑ups, delivering more accurate and useful results for a wide range of user inputs.
How to Implement Hybrid Search
Hybrid search combines semantic (vector) similarity with keyword (BM25) similarity to improve retrieval quality. Below is a concise, step‑by‑step guide you can follow to build your own hybrid search system.
1. Set Up Vector Retrieval (Semantic Search)
- Encode your documents into dense vectors using a language model (e.g., OpenAI embeddings, Sentence‑Transformers, etc.).
- Store the vectors in a vector database (e.g., Pinecone, Weaviate, Milvus, or TurboPuffer).
- Query: Convert the user query into a vector and retrieve the top‑k most similar documents using cosine similarity or inner product.
The details of vector indexing are out of scope for this guide; any standard vector‑search pipeline will work.
2. Add Keyword Retrieval (BM25)
- Index the same corpus with a BM25 implementation (e.g., Elasticsearch, Apache Lucene, Whoosh, or the
rank_bm25Python library). - Query: Run the raw text query against the BM25 index to obtain a relevance score for each document.
BM25 is preferred because it builds on TF‑IDF but includes a more robust scoring formula. Any other keyword search algorithm can be substituted if you have a strong reason to do so.
3. Combine the Two Scores
Hybrid scoring is typically a weighted sum of the semantic and keyword scores:
[ \text{HybridScore}(d) = \alpha \times \text{SemanticScore}(d) + (1-\alpha) \times \text{KeywordScore}(d) ]
| Parameter | Description | Typical Range |
|---|---|---|
| α (alpha) | Weight given to the semantic similarity component | 0.0 – 1.0 (e.g., 0.6) |
| SemanticScore | Cosine similarity (or inner product) between query vector and document vector | 0 – 1 |
| KeywordScore | BM25 relevance score (often normalized) | 0 – 1 |
Tips for choosing α
- Domain‑specific tasks (e.g., legal or medical) often benefit from a higher keyword weight because exact term matching is crucial.
- Open‑ended or conversational queries usually favor a higher semantic weight.
- Dynamic weighting: If you have an LLM‑based agent, let it decide α on the fly based on the query intent (e.g., “look for exact phrase” → lower α).
4. (Optional) Use Existing Packages
If you prefer not to build everything from scratch, several libraries already provide hybrid‑search utilities:
| Library | Vector Store | Keyword Search | Notes |
|---|---|---|---|
| TurboPuffer | Built‑in vector storage | KeyboardSearch package (BM25) | Simple API for combining both modalities |
| Haystack | Multiple back‑ends (FAISS, Milvus, etc.) | Elasticsearch, OpenSearch | Offers a HybridRetriever out of the box |
| Vespa | Native vector and BM25 support | — | Scales to billions of documents |
Even when using a library, it’s valuable to understand the underlying mechanics—implementing the pipeline yourself helps you fine‑tune weighting, normalization, and ranking logic.
5. Evaluate and Iterate
- Create a test set with queries and known relevant documents.
- Run the hybrid pipeline and compute metrics such as Recall@k, NDCG, or MRR.
- Adjust α (or try non‑linear combinations) until you achieve the desired trade‑off between precision and recall.
Summary
- Semantic search gives you contextual relevance; BM25 ensures exact term matching.
- Implement both independently, then merge their scores using a configurable weight (α).
- You can either hand‑craft the pipeline or leverage existing tools like TurboPuffer, Haystack, or Vespa.
- Proper evaluation is key to finding the optimal balance for your specific use case.
Hybrid search isn’t overly complex, and once you have both components in place, you’ll notice a tangible boost in retrieval quality with relatively little extra engineering effort. Happy searching!
Agentic Hybrid Search
Implementing hybrid search is a great way to boost the performance of your Retrieval‑Augmented Generation (RAG) system right out of the gate. However, if you really want to maximise the benefits of a hybrid‑search RAG pipeline, you should make it agentic.
What “agentic” means
A typical RAG flow works like this:
- Retrieve relevant document chunks (vector or keyword search).
- Feed those chunks to an LLM.
- Let the LLM generate the answer.
In an agentic RAG system, the retrieval step is exposed as a tool that the LLM can call on demand. Because the LLM now controls the retrieval, it can make several important decisions that improve answer quality.
Why an agentic approach is powerful
| Capability | How the agent helps | Why it matters |
|---|---|---|
| Prompt rewriting for retrieval | The LLM can rewrite the user query before sending it to the vector store. | Query rewriting is a proven technique for getting more relevant embeddings. |
| Iterative fetching | The LLM can perform a first search, inspect the results, and decide whether to request more chunks. | Allows the model to verify it has enough context before answering, reducing hallucinations. |
| Dynamic weighting of hybrid components | The LLM decides the balance between keyword matching and vector similarity on a per‑query basis. | If the user includes a precise keyword, the model can boost the keyword‑search weight; otherwise it can rely more on semantic similarity. |
How to make retrieval a tool
# Pseudo‑code for an LLM‑driven retrieval tool
def hybrid_search_tool(query: str,
weight_keyword: float = 0.5,
weight_vector: float = 0.5,
top_k: int = 5) -> List[Document]:
"""
• `query` – the (possibly rewritten) search string.
• `weight_keyword` / `weight_vector` – dynamic blend of BM25 and embedding scores.
• `top_k` – number of chunks to return.
Returns a list of the most relevant document chunks.
"""
# 1️⃣ Keyword search (e.g., BM25)
kw_results = bm25_search(query, k=top_k)
# 2️⃣ Vector search (e.g., FAISS / HNSW)
vec_results = embedding_search(query, k=top_k)
# 3️⃣ Combine scores using the supplied weights
combined = blend_results(kw_results, vec_results,
w_kw=weight_keyword, w_vec=weight_vector)
return combined[:top_k]The LLM can call hybrid_search_tool repeatedly, adjusting query, weight_keyword, weight_vector, and top_k each time.
Why this works now
Frontier LLMs (e.g., GPT‑4‑Turbo, Claude‑3, Gemini‑1.5) have become sophisticated enough to:
- Understand when a query is keyword‑heavy vs. concept‑heavy.
- Rewrite prompts to surface the most useful embeddings.
- Reason about the sufficiency of retrieved context and decide whether more information is needed.
A few months ago, giving an LLM that level of autonomy would have been risky. Today, the models are reliable enough that this dynamic, tool‑driven approach is not only feasible but recommended.
Bottom line
- Implement hybrid search (keyword + vector).
- Expose retrieval as a callable tool for the LLM.
- Let the LLM decide
- how to phrase the search query,
- how many and which chunks to fetch, and
- how to weight keyword vs. semantic similarity.
By combining hybrid retrieval with an agentic LLM, you can super‑charge your RAG system and achieve far better results than a static, vector‑only pipeline.
Conclusion
In this article I discussed how to implement hybrid search in your RAG system and how to make your RAG pipeline more authentic for significantly better results. Combining these two techniques can dramatically boost the performance of your information‑retrieval stack, and it can be implemented quite easily with coding agents such as Claude Code. I believe Agentex Systems represents the future of information retrieval, and I encourage you to equip your agents with hybrid‑search capabilities so they can handle the heavy lifting for you.
📚 Resources
Free eBook & Webinar
Find me on socials