How to Build Agentic RAG with Hybrid Search

Published: (March 13, 2026 at 08:00 AM EDT)
9 min read

Source: Towards Data Science

RAG, also known as Hybrid Search, is a powerful technique for retrieving relevant documents from a corpus and feeding those chunks to a Large Language Model (LLM) to answer user queries.

How Traditional RAG Works

  1. Vector similarity is used to locate semantically similar document chunks.
  2. The most relevant chunks are passed to the LLM, which generates a response.

This approach works well in many cases because semantic similarity can capture nuanced meanings.

When Vector Similarity Falls Short

  • Users provide specific keywords, IDs, or exact phrases that must be matched verbatim.
  • Pure semantic search may miss these exact matches, leading to incomplete or inaccurate answers.

Hybrid search combines:

  • Keyword search (exact matching)
  • Vector similarity (semantic matching)

By leveraging both methods, you can retrieve the most relevant chunks even when exact terms are required.

What You’ll Learn

  • Why hybrid search improves RAG performance in keyword‑heavy scenarios.
  • How to implement an agentic RAG system that dynamically decides between keyword and vector retrieval.
  • Practical steps and code snippets to build the hybrid pipeline.

Visual Overview

Learn how to build an agentic hybrid search RAG.
Infographic summarizing the main contents of this article. Image by Gemini.

Vector similarity is a powerful tool for retrieving relevant chunks from a corpus, even when the input prompt contains typos or synonyms (e.g., lift instead of elevator). However, it has notable limitations:

  • Keyword Sensitivity: Vector models do not give special weight to individual words or identifiers. As a result, keywords or IDs can be “drowned out” by other semantically related terms, making it hard for pure semantic search to surface the most relevant documents.
  • Exact Matching: When the user’s query includes specific terms—such as product codes, serial numbers, or unique names—vector similarity alone often fails to prioritize the matching documents.

Traditional keyword‑based methods (e.g., BM25) excel at:

  • Exact Term Matching: If a word appears in only one document, that document receives a high relevance score when the term is present in the query.
  • Identifier Retrieval: Unique IDs or codes are directly matched, ensuring the correct document surfaces.

Benefits of a Hybrid Approach

Combining vector similarity with keyword search gives you the best of both worlds:

  1. Broader Coverage – Semantic similarity captures relevant content even with paraphrasing or misspellings.
  2. Precise Retrieval – Keyword scoring ensures that exact terms and identifiers are not overlooked.
  3. Higher Relevance – The hybrid score balances the two signals, delivering results that are both contextually appropriate and precisely matched to the user’s intent.

Bottom line: Use hybrid search when you want to handle both fuzzy, semantic queries and exact keyword or identifier look‑ups, delivering more accurate and useful results for a wide range of user inputs.

Hybrid search combines semantic (vector) similarity with keyword (BM25) similarity to improve retrieval quality. Below is a concise, step‑by‑step guide you can follow to build your own hybrid search system.

  1. Encode your documents into dense vectors using a language model (e.g., OpenAI embeddings, Sentence‑Transformers, etc.).
  2. Store the vectors in a vector database (e.g., Pinecone, Weaviate, Milvus, or TurboPuffer).
  3. Query: Convert the user query into a vector and retrieve the top‑k most similar documents using cosine similarity or inner product.

The details of vector indexing are out of scope for this guide; any standard vector‑search pipeline will work.

2. Add Keyword Retrieval (BM25)

  1. Index the same corpus with a BM25 implementation (e.g., Elasticsearch, Apache Lucene, Whoosh, or the rank_bm25 Python library).
  2. Query: Run the raw text query against the BM25 index to obtain a relevance score for each document.

BM25 is preferred because it builds on TF‑IDF but includes a more robust scoring formula. Any other keyword search algorithm can be substituted if you have a strong reason to do so.

3. Combine the Two Scores

Hybrid scoring is typically a weighted sum of the semantic and keyword scores:

[ \text{HybridScore}(d) = \alpha \times \text{SemanticScore}(d) + (1-\alpha) \times \text{KeywordScore}(d) ]

ParameterDescriptionTypical Range
α (alpha)Weight given to the semantic similarity component0.0 – 1.0 (e.g., 0.6)
SemanticScoreCosine similarity (or inner product) between query vector and document vector0 – 1
KeywordScoreBM25 relevance score (often normalized)0 – 1

Tips for choosing α

  • Domain‑specific tasks (e.g., legal or medical) often benefit from a higher keyword weight because exact term matching is crucial.
  • Open‑ended or conversational queries usually favor a higher semantic weight.
  • Dynamic weighting: If you have an LLM‑based agent, let it decide α on the fly based on the query intent (e.g., “look for exact phrase” → lower α).

4. (Optional) Use Existing Packages

If you prefer not to build everything from scratch, several libraries already provide hybrid‑search utilities:

LibraryVector StoreKeyword SearchNotes
TurboPufferBuilt‑in vector storageKeyboardSearch package (BM25)Simple API for combining both modalities
HaystackMultiple back‑ends (FAISS, Milvus, etc.)Elasticsearch, OpenSearchOffers a HybridRetriever out of the box
VespaNative vector and BM25 supportScales to billions of documents

Even when using a library, it’s valuable to understand the underlying mechanics—implementing the pipeline yourself helps you fine‑tune weighting, normalization, and ranking logic.

5. Evaluate and Iterate

  1. Create a test set with queries and known relevant documents.
  2. Run the hybrid pipeline and compute metrics such as Recall@k, NDCG, or MRR.
  3. Adjust α (or try non‑linear combinations) until you achieve the desired trade‑off between precision and recall.

Summary

  • Semantic search gives you contextual relevance; BM25 ensures exact term matching.
  • Implement both independently, then merge their scores using a configurable weight (α).
  • You can either hand‑craft the pipeline or leverage existing tools like TurboPuffer, Haystack, or Vespa.
  • Proper evaluation is key to finding the optimal balance for your specific use case.

Hybrid search isn’t overly complex, and once you have both components in place, you’ll notice a tangible boost in retrieval quality with relatively little extra engineering effort. Happy searching!

Implementing hybrid search is a great way to boost the performance of your Retrieval‑Augmented Generation (RAG) system right out of the gate. However, if you really want to maximise the benefits of a hybrid‑search RAG pipeline, you should make it agentic.

What “agentic” means

A typical RAG flow works like this:

  1. Retrieve relevant document chunks (vector or keyword search).
  2. Feed those chunks to an LLM.
  3. Let the LLM generate the answer.

In an agentic RAG system, the retrieval step is exposed as a tool that the LLM can call on demand. Because the LLM now controls the retrieval, it can make several important decisions that improve answer quality.

Why an agentic approach is powerful

CapabilityHow the agent helpsWhy it matters
Prompt rewriting for retrievalThe LLM can rewrite the user query before sending it to the vector store.Query rewriting is a proven technique for getting more relevant embeddings.
Iterative fetchingThe LLM can perform a first search, inspect the results, and decide whether to request more chunks.Allows the model to verify it has enough context before answering, reducing hallucinations.
Dynamic weighting of hybrid componentsThe LLM decides the balance between keyword matching and vector similarity on a per‑query basis.If the user includes a precise keyword, the model can boost the keyword‑search weight; otherwise it can rely more on semantic similarity.

How to make retrieval a tool

# Pseudo‑code for an LLM‑driven retrieval tool
def hybrid_search_tool(query: str,
                        weight_keyword: float = 0.5,
                        weight_vector: float = 0.5,
                        top_k: int = 5) -> List[Document]:
    """
    • `query` – the (possibly rewritten) search string.
    • `weight_keyword` / `weight_vector` – dynamic blend of BM25 and embedding scores.
    • `top_k` – number of chunks to return.
    Returns a list of the most relevant document chunks.
    """
    # 1️⃣ Keyword search (e.g., BM25)
    kw_results = bm25_search(query, k=top_k)

    # 2️⃣ Vector search (e.g., FAISS / HNSW)
    vec_results = embedding_search(query, k=top_k)

    # 3️⃣ Combine scores using the supplied weights
    combined = blend_results(kw_results, vec_results,
                             w_kw=weight_keyword, w_vec=weight_vector)

    return combined[:top_k]

The LLM can call hybrid_search_tool repeatedly, adjusting query, weight_keyword, weight_vector, and top_k each time.

Why this works now

Frontier LLMs (e.g., GPT‑4‑Turbo, Claude‑3, Gemini‑1.5) have become sophisticated enough to:

  • Understand when a query is keyword‑heavy vs. concept‑heavy.
  • Rewrite prompts to surface the most useful embeddings.
  • Reason about the sufficiency of retrieved context and decide whether more information is needed.

A few months ago, giving an LLM that level of autonomy would have been risky. Today, the models are reliable enough that this dynamic, tool‑driven approach is not only feasible but recommended.

Bottom line

  1. Implement hybrid search (keyword + vector).
  2. Expose retrieval as a callable tool for the LLM.
  3. Let the LLM decide
    • how to phrase the search query,
    • how many and which chunks to fetch, and
    • how to weight keyword vs. semantic similarity.

By combining hybrid retrieval with an agentic LLM, you can super‑charge your RAG system and achieve far better results than a static, vector‑only pipeline.

Conclusion

In this article I discussed how to implement hybrid search in your RAG system and how to make your RAG pipeline more authentic for significantly better results. Combining these two techniques can dramatically boost the performance of your information‑retrieval stack, and it can be implemented quite easily with coding agents such as Claude Code. I believe Agentex Systems represents the future of information retrieval, and I encourage you to equip your agents with hybrid‑search capabilities so they can handle the heavy lifting for you.

📚 Resources

0 views
Back to Blog

Related posts

Read more »

Improving RAG Systems with PageIndex

The Hidden Problem with Traditional RAG Most RAG pipelines follow a similar workflow: 1. Documents are split into chunks. 2. Each chunk is converted into embed...

Why Care About Prompt Caching in LLMs?

Scaling Costs and Latency in RAG and AI Agents We’ve talked a lot about what an incredible tool RAG is for leveraging the power of AI on custom data. Whether w...