Retrieval-Augmented Generation: Connecting LLMs to Your Data

Published: (December 6, 2025 at 10:00 PM EST)
4 min read
Source: Dev.to

Source: Dev.to

Tech Acronyms Reference

AcronymMeaning
APIApplication Programming Interface
BERTBidirectional Encoder Representations from Transformers
FAISSFacebook AI Similarity Search
GPUGraphics Processing Unit
JSONJavaScript Object Notation
LLMLarge Language Model
RAGRetrieval‑Augmented Generation
ROIReturn on Investment
SQLStructured Query Language
VRAMVideo Random Access Memory

Why LLMs Need External Data

Large Language Models (LLMs) have a fundamental limitation: their knowledge is frozen at training time.

Ask GPT‑4 about:

  • “What did our Q3 sales look like?” → ❌ Doesn’t know your data
  • “What’s in our employee handbook?” → ❌ Doesn’t have your docs
  • “Show me tickets from yesterday” → ❌ No real‑time access
  • “What did the customer say in ticket #45632?” → ❌ Can’t see your database

The LLM has no knowledge of your specific data.

Solutions Overview

ApproachProsCons
Fine‑tuningTailors model to your dataExpensive, slow, static
Long contextSimple prompt‑only solutionLimited by context window, costly
Retrieval‑Augmented Generation (RAG)Retrieve relevant data then generateFlexible, scalable, cost‑effective

This article focuses on RAG, the most practical approach for production systems.

What Is Retrieval‑Augmented Generation (RAG)?

RAG connects LLMs to proprietary data at scale. It consists of three stages:

  1. Indexing (offline) – Process documents into vector embeddings and store them in a vector database.
  2. Retrieval (query time) – Embed the user query, search the vector store, and return the top‑k most relevant chunks.
  3. Generation – Feed the retrieved chunks plus the original query to the LLM to produce a final answer.

Real‑Life Analogy: The Research Assistant

StageWhat the assistant does
IndexingReads all company documents, creates organized notes, files them for quick retrieval.
RetrievalWhen you ask a question, searches the notes and pulls out the most relevant documents.
GenerationReads the retrieved documents, formulates an answer, and responds.

RAG Workflow Diagram

┌─────────────────────────────────────────────────────────┐
│                    INDEXING (Offline)                    │
├─────────────────────────────────────────────────────────┤
│ Documents → Chunking → Embeddings → Vector Database       │
│ "handbook.pdf" → paragraphs → vector representations      │
│ "policies.docx" → paragraphs → vector representations      │
│ "faqs.md"      → paragraphs → vector representations      │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│                  RETRIEVAL (Query Time)                  │
├─────────────────────────────────────────────────────────┤
│ User Query → Embed Query → Search Vector DB → Top‑K      │
│ "What's the return policy?" → vector → find similar chunks │
│ → return 5 most relevant chunks                           │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│                  GENERATION (Response)                   │
├─────────────────────────────────────────────────────────┤
│ Retrieved Docs + Query → LLM → Final Answer               │
│ Context: [5 relevant chunks about returns]                │
│ Question: "What is the return policy?"                    │
│ LLM Output: "Our return policy allows returns within 30   │
│ days of purchase. Items must be in original condition..." │
└─────────────────────────────────────────────────────────┘

Installation

pip install langchain
pip install chromadb      # Vector database
pip install sentence-transformers  # Embeddings
pip install litellm      # LLM interface
pip install pypdf        # PDF processing

Python Example: Loading and Chunking Documents

from typing import List
import re

def load_documents(file_paths: List[str]) -> List[str]:
    """Load plain‑text documents from a list of file paths."""
    documents = []
    for path in file_paths:
        with open(path, "r", encoding="utf-8") as f:
            documents.append(f.read())
    return documents

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> List[str]:
    """
    Split `text` into overlapping chunks.

    Parameters
    ----------
    text : str
        Input text to chunk.
    chunk_size : int, default 500
        Target size of each chunk (characters).
    overlap : int, default 50
        Number of characters to overlap between consecutive chunks.
    """
    # Simple sentence‑aware chunking
    sentences = re.split(r"(? chunk_size and current_chunk:
            chunks.append(" ".join(current_chunk))

            # Preserve overlap for the next chunk
            overlap_sentences = []
            overlap_len = 0
            for s in reversed(current_chunk):
                if overlap_len + len(s)  List[dict]:
        """
        Retrieve the `top_k` most similar chunks for `query_text`.

        Returns
        -------
        List[dict] with keys `id`, `document`, `metadata`, `distance`.
        """
        query_emb = self.embedding_model.encode([query_text]).tolist()
        results = self.collection.query(
            query_embeddings=query_emb,
            n_results=top_k,
            include=["documents", "metadatas", "distances", "ids"]
        )
        # Re‑format results for easier consumption
        hits = []
        for i in range(len(results["ids"][0])):
            hits.append({
                "id": results["ids"][0][i],
                "document": results["documents"][0][i],
                "metadata": results["metadatas"][0][i],
                "distance": results["distances"][0][i],
            })
        return hits

You can now combine the chunking logic with VectorStore to build a full RAG pipeline:

  1. Load raw documents.
  2. Chunk them with chunk_text.
  3. Insert the chunks into VectorStore.
  4. At query time, embed the user question, retrieve the top‑k chunks, and pass the concatenated context plus the original question to your LLM (e.g., via litellm or langchain).

End of article.

Back to Blog

Related posts

Read more »