Retrieval-Augmented Generation: Connecting LLMs to Your Data
Source: Dev.to
Tech Acronyms Reference
| Acronym | Meaning |
|---|---|
| API | Application Programming Interface |
| BERT | Bidirectional Encoder Representations from Transformers |
| FAISS | Facebook AI Similarity Search |
| GPU | Graphics Processing Unit |
| JSON | JavaScript Object Notation |
| LLM | Large Language Model |
| RAG | Retrieval‑Augmented Generation |
| ROI | Return on Investment |
| SQL | Structured Query Language |
| VRAM | Video Random Access Memory |
Why LLMs Need External Data
Large Language Models (LLMs) have a fundamental limitation: their knowledge is frozen at training time.
Ask GPT‑4 about:
- “What did our Q3 sales look like?” → ❌ Doesn’t know your data
- “What’s in our employee handbook?” → ❌ Doesn’t have your docs
- “Show me tickets from yesterday” → ❌ No real‑time access
- “What did the customer say in ticket #45632?” → ❌ Can’t see your database
The LLM has no knowledge of your specific data.
Solutions Overview
| Approach | Pros | Cons |
|---|---|---|
| Fine‑tuning | Tailors model to your data | Expensive, slow, static |
| Long context | Simple prompt‑only solution | Limited by context window, costly |
| Retrieval‑Augmented Generation (RAG) | Retrieve relevant data then generate | Flexible, scalable, cost‑effective |
This article focuses on RAG, the most practical approach for production systems.
What Is Retrieval‑Augmented Generation (RAG)?
RAG connects LLMs to proprietary data at scale. It consists of three stages:
- Indexing (offline) – Process documents into vector embeddings and store them in a vector database.
- Retrieval (query time) – Embed the user query, search the vector store, and return the top‑k most relevant chunks.
- Generation – Feed the retrieved chunks plus the original query to the LLM to produce a final answer.
Real‑Life Analogy: The Research Assistant
| Stage | What the assistant does |
|---|---|
| Indexing | Reads all company documents, creates organized notes, files them for quick retrieval. |
| Retrieval | When you ask a question, searches the notes and pulls out the most relevant documents. |
| Generation | Reads the retrieved documents, formulates an answer, and responds. |
RAG Workflow Diagram
┌─────────────────────────────────────────────────────────┐
│ INDEXING (Offline) │
├─────────────────────────────────────────────────────────┤
│ Documents → Chunking → Embeddings → Vector Database │
│ "handbook.pdf" → paragraphs → vector representations │
│ "policies.docx" → paragraphs → vector representations │
│ "faqs.md" → paragraphs → vector representations │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ RETRIEVAL (Query Time) │
├─────────────────────────────────────────────────────────┤
│ User Query → Embed Query → Search Vector DB → Top‑K │
│ "What's the return policy?" → vector → find similar chunks │
│ → return 5 most relevant chunks │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ GENERATION (Response) │
├─────────────────────────────────────────────────────────┤
│ Retrieved Docs + Query → LLM → Final Answer │
│ Context: [5 relevant chunks about returns] │
│ Question: "What is the return policy?" │
│ LLM Output: "Our return policy allows returns within 30 │
│ days of purchase. Items must be in original condition..." │
└─────────────────────────────────────────────────────────┘
Installation
pip install langchain
pip install chromadb # Vector database
pip install sentence-transformers # Embeddings
pip install litellm # LLM interface
pip install pypdf # PDF processing
Python Example: Loading and Chunking Documents
from typing import List
import re
def load_documents(file_paths: List[str]) -> List[str]:
"""Load plain‑text documents from a list of file paths."""
documents = []
for path in file_paths:
with open(path, "r", encoding="utf-8") as f:
documents.append(f.read())
return documents
def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> List[str]:
"""
Split `text` into overlapping chunks.
Parameters
----------
text : str
Input text to chunk.
chunk_size : int, default 500
Target size of each chunk (characters).
overlap : int, default 50
Number of characters to overlap between consecutive chunks.
"""
# Simple sentence‑aware chunking
sentences = re.split(r"(? chunk_size and current_chunk:
chunks.append(" ".join(current_chunk))
# Preserve overlap for the next chunk
overlap_sentences = []
overlap_len = 0
for s in reversed(current_chunk):
if overlap_len + len(s) List[dict]:
"""
Retrieve the `top_k` most similar chunks for `query_text`.
Returns
-------
List[dict] with keys `id`, `document`, `metadata`, `distance`.
"""
query_emb = self.embedding_model.encode([query_text]).tolist()
results = self.collection.query(
query_embeddings=query_emb,
n_results=top_k,
include=["documents", "metadatas", "distances", "ids"]
)
# Re‑format results for easier consumption
hits = []
for i in range(len(results["ids"][0])):
hits.append({
"id": results["ids"][0][i],
"document": results["documents"][0][i],
"metadata": results["metadatas"][0][i],
"distance": results["distances"][0][i],
})
return hits
You can now combine the chunking logic with VectorStore to build a full RAG pipeline:
- Load raw documents.
- Chunk them with
chunk_text. - Insert the chunks into
VectorStore. - At query time, embed the user question, retrieve the top‑k chunks, and pass the concatenated context plus the original question to your LLM (e.g., via
litellmorlangchain).
End of article.