Vectorized Thinking: Building Production-Ready RAG Pipelines with Elasticsearch

Published: 11 hours ago (March 3, 2026 at 04:22 AM EST)

8 min read

Source: Dev.to

Abstract

While traditional keyword‑based search has served us for decades, it often fails to grasp the nuances of human intent in the era of Generative AI. In this guide we explore the shift toward Vectorized Thinking. We will implement a complete Retrieval‑Augmented Generation (RAG) pipeline using the Elasticsearch Relevance Engine (ESRE) and OpenAI embeddings, demonstrating how to bridge the gap between lexical matching and semantic understanding. By the end of this article you will understand how to build, optimize, and deploy a RAG system that is both accurate and scalable.

1. The “Semantic Gap” in Keyword Search

Traditional search engines rely on lexical matching, typically using the BM25 algorithm. While BM25 is excellent for finding exact terms, it is fundamentally blind to meaning. This creates what we call the Semantic Gap.

The Problem – Imagine a user asking a support bot, “How do I recover my account?” If your knowledge base only contains the phrase “Reset your password using the Forgot Password option,” a standard keyword search might fail. Why? Because the words recover and account do not appear in the target document.

This gap leads to hallucinations in LLMs (Large Language Models) because, without the right context, the model is forced to guess. Vector search solves this by representing intent as mathematical coordinates in a high‑dimensional space, allowing the system to “understand” that recovery and resetting are semantically identical in this context.

2. What Is Vector Search?

Vector search converts unstructured text into dense numerical representations called embeddings. These embeddings map words, sentences, or entire documents into a high‑dimensional space where “Account Recovery” and “Password Reset” are geographically close.

Key Concepts for the Elastic Stack

Concept	Description
Dense Vector Embeddings	Fixed‑length arrays (e.g., 1536 dimensions for OpenAI’s `text‑embedding‑3‑small`) that act as a digital fingerprint for meaning.
Cosine Similarity	Measures the angle between two vectors; a smaller angle indicates higher semantic similarity.
HNSW (Hierarchical Navigable Small World)	High‑performance indexing algorithm used by Elasticsearch. It builds a multi‑layered graph that finds the nearest neighbors in milliseconds, skipping billions of irrelevant documents. Think of it as a “skip list” for multi‑dimensional space.

3. System Architecture

A production‑grade RAG pipeline isn’t just a single script; it’s a lifecycle consisting of two distinct loops. Understanding this flow is critical for building reliable GenAI applications.

[Diagram: The RAG Lifecycle]

Ingestion Loop (Offline)

Raw Documents → Chunking Service → OpenAI Embedder → Elasticsearch Index (Vector Store)

In this phase we prepare our knowledge base by turning text into searchable vectors.

Inference Loop (Online)

User Query → OpenAI Embedder → kNN Vector Search in Elastic → Context Injection → LLM Response

In this phase we use the user’s query to find the best context before asking the LLM for an answer.

4. Implementation Guide with Elastic Cloud

To follow this guide you need an Elastic Cloud instance (managed environment that includes the Elasticsearch Relevance Engine (ESRE), which simplifies integration of external model providers like OpenAI).

Step 1: Defining the Vector Schema

PUT /rag-index
{
  "mappings": {
    "properties": {
      "text": {
        "type": "text"
      },
      "metadata": {
        "type": "keyword"
      },
      "embedding": {
        "type": "dense_vector",
        "dims": 1536,
        "index": true,
        "similarity": "cosine",
        "index_options": {
          "type": "hnsw",
          "m": 16,
          "ef_construction": 100
        }
      }
    }
  }
}

Technical note: m defines the number of bi‑directional links for each new element (higher values improve accuracy but increase indexing time). ef_construction controls the size of the dynamic list used during graph construction.

Step 2: Intelligent Chunking & Ingestion

Embedding whole articles leads to semantic dilution. The “Goldilocks” zone is 500‑800 tokens per chunk with a 10 % overlap to preserve context across chunk boundaries.

def chunk_text(text, limit=500, overlap=50):
    """
    Split `text` into overlapping chunks.
    - `limit`  : maximum number of tokens per chunk
    - `overlap`: number of tokens to overlap between consecutive chunks
    """
    words = text.split()
    chunks = []
    for i in range(0, len(words), limit - overlap):
        chunks.append(" ".join(words[i:i + limit]))
    return chunks

Ingestion Loop Example

for chunk in chunk_text(raw_document):
    # Generate embedding via OpenAI
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=chunk
    )
    vector = response.data[0].embedding

    # Index into Elasticsearch
    es.index(
        index="rag-index",
        document={
            "text": chunk,
            "embedding": vector,
            "metadata": {"source": "kb"}   # optional metadata
        }
    )

Step 3: Semantic Retrieval (kNN)

We search for the vector of the user’s intent, not for raw text. The num_candidates parameter tells Elasticsearch how many results to consider across the HNSW graph layers.

search_response = es.search(
    index="rag-index",
    knn={
        "field": "embedding",
        "query_vector": user_query_vector,
        "k": 3,                # top‑k results to return
        "num_candidates": 100 # candidates examined in the graph
    }
)

5. Advanced Optimization: Hybrid Search & Reciprocal Rank Fusion (RRF)

While pure vector search is powerful, combining it with traditional lexical search often yields the best of both worlds.

Hybrid Search – Simultaneously query BM25 (or another lexical scorer) and the vector field, then merge the scores.
Reciprocal Rank Fusion (RRF) – A simple, score‑agnostic method that combines two ranked lists by rewarding items that appear high in both.

POST /rag-index/_search
{
  "size": 5,
  "query": {
    "bool": {
      "should": [
        {
          "knn": {
            "embedding": {
              "vector": user_query_vector,
              "k": 5
            }
          }
        },
        {
          "match": {
            "text": {
              "query": user_query_text,
              "boost": 2.0
            }
          }
        }
      ]
    }
  }
}

After retrieving the hybrid results, apply RRF in your application layer to produce a final ranking.

6. Deploying the RAG Service

Containerize the ingestion and inference scripts (Docker).
Expose a lightweight HTTP endpoint (FastAPI / Flask) that:
- Receives a user query.
- Calls OpenAI to embed the query.
- Executes the kNN search.
- Formats the top‑k chunks as context.
- Sends the context + query to the LLM (e.g., gpt‑4o).
Scale the service with Kubernetes or Elastic Cloud’s built‑in autoscaling.

7. Monitoring & Maintenance

Metric	Why It Matters
Indexing latency	Ensures ingestion keeps up with source updates.
kNN query latency	Guarantees a responsive user experience (< 200 ms typical).
Embedding cost	Track OpenAI token usage to control expenses.
LLM hallucination rate	Periodically audit responses; high hallucination indicates insufficient context.

Set up alerts in Elastic Observability (APM, Logs, Metrics) to stay ahead of regressions.

8. Weakness: Exact Term Matching

If a user searches for a specific part number like “SKU‑9904‑X,” a pure vector search may return “similar” parts instead of the exact one.

Solution – Hybrid Search with Reciprocal Rank Fusion (RRF)

RRF lets you combine the results of a BM25 keyword search and a k‑NN vector search into a single, unified ranking.

GET /rag-index/_search
{
  "query": {
    "match": {
      "text": "SKU-9904-X"
    }
  },
  "knn": {
    "field": "embedding",
    "query_vector": [0.12, 0.45, ...],
    "k": 10,
    "num_candidates": 100
  },
  "rank": {
    "rrf": {}
  }
}

By merging these two methods you get the “best of both worlds” – the precision of keyword matching and the intuition of semantic search.

9. Production Considerations: “Lessons from the Trenches”

Building a Retrieval‑Augmented Generation (RAG) pipeline in production requires more than just logic; it demands infrastructure awareness.

Quantization (Scalar Quantization)

Vector storage is RAM‑intensive.
Elasticsearch supports int8 quantization, compressing vectors from 32‑bit floats to 8‑bit integers.
In practice this saved ≈75 % of memory with < 1 % drop in retrieval accuracy.

Circuit Breakers

Your embedding provider (OpenAI, Anthropic, etc.) is a third‑party dependency.
Implement exponential backoff and circuit breakers.
If the embedder is down, gracefully degrade to keyword‑only search instead of crashing.

The Reranker Pattern

For high‑stakes applications, use a two‑stage retrieval process:
1. Elasticsearch returns the top 50 documents.
2. A cross‑encoder model (e.g., Cohere Rerank) selects the final top 3.
This significantly improves precision.

10. Observations & Performance

Testing this architecture on a dataset of 50 000 technical documents yielded:

Metric	Result
Accuracy	40 % reduction in LLM “hallucinations.” Grounding the model in retrieved facts made answers more factual and concise.
Latency	The “Semantic Hop” (calling the embedding API) adds ≈150 ms per query. For latency‑sensitive apps, cache embeddings for frequent queries.

Conclusion

Vectorized thinking shifts focus from keywords to intent. By leveraging the Elasticsearch Relevance Engine, developers can build search experiences that truly understand the user. Whether you’re building a customer‑support bot or a complex research tool, the combination of HNSW indexing, Hybrid Search, and LLM augmentation provides a solid foundation for the next generation of AI‑driven applications.

Vectorized Thinking: Building Production-Ready RAG Pipelines with Elasticsearch

Abstract

1. The “Semantic Gap” in Keyword Search

2. What Is Vector Search?

Key Concepts for the Elastic Stack

3. System Architecture

Ingestion Loop (Offline)

Inference Loop (Online)

4. Implementation Guide with Elastic Cloud

Step 1: Defining the Vector Schema

Step 2: Intelligent Chunking & Ingestion

Ingestion Loop Example

Step 3: Semantic Retrieval (kNN)

5. Advanced Optimization: Hybrid Search & Reciprocal Rank Fusion (RRF)

6. Deploying the RAG Service

7. Monitoring & Maintenance

8. Weakness: Exact Term Matching

Solution – Hybrid Search with Reciprocal Rank Fusion (RRF)

9. Production Considerations: “Lessons from the Trenches”

Quantization (Scalar Quantization)

Circuit Breakers

The Reranker Pattern

10. Observations & Performance

Conclusion

Related posts

Shared Workflows: minha experiência definindo pipelines reutilizáveis

Building a Local-First Financial IDE: How I forced Gemini AI to do strict Double-Entry Accounting

I ran cursor-doctor on 50 real projects. Here's what broke.

Google Gemini Writing Challenge

Abstract

1. The “Semantic Gap” in Keyword Search

2. What Is Vector Search?

Key Concepts for the Elastic Stack

3. System Architecture

Ingestion Loop (Offline)

Inference Loop (Online)

4. Implementation Guide with Elastic Cloud

Step 1: Defining the Vector Schema

Step 2: Intelligent Chunking & Ingestion

Ingestion Loop Example

Step 3: Semantic Retrieval (kNN)

5. Advanced Optimization: Hybrid Search & Reciprocal Rank Fusion (RRF)

6. Deploying the RAG Service

7. Monitoring & Maintenance

8. Weakness: Exact Term Matching

Solution – Hybrid Search with Reciprocal Rank Fusion (RRF)

9. Production Considerations: “Lessons from the Trenches”

Quantization (Scalar Quantization)

Circuit Breakers

The Reranker Pattern

10. Observations & Performance

Conclusion

Related posts

Shared Workflows: minha experiência definindo pipelines reutilizáveis

Building a Local-First Financial IDE: How I forced Gemini AI to do strict Double-Entry Accounting

I ran cursor-doctor on 50 real projects. Here's what broke.

Google Gemini Writing Challenge

Step 1: Defining the Vector Schema

Step 2: Intelligent Chunking & Ingestion

Step 3: Semantic Retrieval (kNN)