RAG in the Wild: What I Learned After Two Weeks of Chunking Experiments
Source: Dev.to
Three months ago I shipped a RAG pipeline that I was genuinely proud of
Semantic search over our internal docs, OpenAI embeddings, Pinecone on the backend. It felt modern. Then someone on our team asked it “what’s our parental leave policy?” and it returned a confident three‑paragraph answer that was completely fabricated—stitched together from an old HR doc, a Confluence page about PTO, and what I can only assume was vibes.
That was my wake‑up call. The embedding model wasn’t broken. The vector DB wasn’t broken. The retrieval step—the part I had basically copy‑pasted from a tutorial and moved on—was the problem. I spent the next two weeks obsessively fixing it, and this is what I found.
Your Chunk Size Is Probably Wrong (Mine Was)
Most tutorials tell you to chunk at 512 tokens and call it a day. I did that. It worked okay for short factual lookups but fell apart the moment a question required synthesizing information across a longer document—like, say, a policy that spans three sections with cross‑references.
| Chunk Size | Pros | Cons |
|---|---|---|
| Small chunks (≈512 tokens) | Higher retrieval precision (the relevant sentence actually makes it into the top‑k results) | Context is stripped, hurting answer quality |
| Large chunks (≈1,200 tokens) | Keeps context, useful for synthesis | Lower precision; the relevant info may be buried deep inside a large blob |
I ran a controlled experiment on our documentation corpus (≈800 documents, mix of Markdown and PDFs). Three strategies:
- Fixed‑size chunking – 512 tokens with 50‑token overlap. Baseline. Easy to implement, predictable performance. Also where I started.
- Semantic chunking – split on sentence boundaries, then group sentences until a semantic shift is detected (measured by cosine distance between consecutive sentence embeddings). I used
langchain’sSemanticChunker(LangChain v0.2.x). This produced chunks ranging from 80 to 600 tokens depending on document structure. - Hierarchical / parent‑document retrieval – store small chunks for retrieval, but when a chunk is retrieved, return its larger parent chunk to the LLM. This is the one that actually moved the needle.
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Child chunks — what gets embedded and searched
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)
# Parent chunks — what the LLM actually sees
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=800)
store = InMemoryStore()
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
# Index child chunks but store parent chunks
retriever.add_documents(docs)
# At query time, retrieval happens on child embeddings,
# but the returned context is the full parent chunk
results = retriever.invoke("what is the parental leave policy?")
Results (100 manually‑labeled QA pairs)
| Strategy | Accuracy |
|---|---|
| Fixed‑size (512 tok) | 61 % |
| Semantic chunking | 68 % |
| Parent‑document retrieval | 79 % |
Takeaway: Start with a fixed‑size baseline (≈512 tokens). Then try parent‑document retrieval before investing in semantic chunking. The complexity‑to‑benefit ratio on semantic chunking was disappointing for most real‑world corpora.
Picking a Vector Database Without Losing Your Mind
I tested four options: Pinecone, Qdrant, Weaviate, and pgvector. My setup was a single‑node deployment for a team of ~30 people—not a million‑user product—so take the performance numbers with appropriate context.
| DB | Pros | Cons |
|---|---|---|
| Pinecone | Fully managed, clean Python SDK, zero‑ops infra | Metadata filtering has gotchas (cardinality limits); pricing penalises large metadata payloads; early‑2025 bug with high‑cardinality string filters |
| Qdrant | Open source, Docker‑ready, expressive query API, hybrid (sparse + dense) search, Rust core = fast | Docs have gaps; async Python client a bit rough (v1.7) |
| Weaviate | Built‑in BM25, native hybrid search, GraphQL interface; great for multi‑modal retrieval | Large surface area; overkill for simple RAG pipelines |
| pgvector | Leverages existing Postgres, HNSW index gives decent latency up to a few hundred k vectors | Not tested beyond that scale; may need tuning for larger corpora |
My recommendation
- Fully managed & no hybrid search needed → Pinecone
- Control + hybrid search out of the box → Qdrant
- Already on Postgres with ≤ 500 k chunks → pgvector (a legitimate first‑class choice, not just a fallback)
The Retrieval Step Is Where Most RAG Pipelines Leave Performance on the Table
Basic vector search—embed the query, find the nearest neighbors, feed the top‑k chunks to the LLM—works, but it’s only the foundation. The real gains come from:
- Smart chunking (as described above).
- Hierarchical retrieval that gives the LLM enough context without sacrificing precision.
- Choosing the right vector store for your scale, latency requirements, and feature set.
By tightening these three levers, you can turn a “confident but fabricated” answer into a reliable, grounded response—exactly what any production RAG system should deliver.
rs is a reasonable starting point. It’s also where people stop, and it shows.
Hybrid search (sparse + dense) made a surprisingly large difference. Dense embeddings capture semantics but struggle with exact keyword matches — product names, error codes, specific version strings. Sparse retrieval (BM25) nails those. Combining them with reciprocal rank fusion (RRF) gives you the best of both.
from qdrant_client import QdrantClient, models
# Assuming you've set up a collection with both dense and sparse vectors
results = client.query_points(
collection_name="docs",
prefetch=[
# Dense vector search (semantic)
models.Prefetch(
query=dense_embedding, # your query embedding
using="dense",
limit=20,
),
# Sparse vector search (BM25‑style)
models.Prefetch(
query=models.SparseVector(
indices=sparse_indices,
values=sparse_values,
),
using="sparse",
limit=20,
),
],
# RRF fusion happens here
query=models.FusionQuery(fusion=models.Fusion.RRF),
limit=5,
)
Hybrid search helped most on technical documentation with product‑specific terminology. On more conversational or policy‑style content, the improvement was modest. If your corpus is dense with jargon or version numbers, it’s worth the implementation overhead.
Reranking is the other lever that moved things significantly. After your initial retrieval (say, top‑20 chunks), run a cross‑encoder reranker to reorder them before passing to the LLM. The intuition: bi‑encoders (what you use for initial retrieval) encode query and document independently for speed. Cross‑encoders look at query + document jointly and are much more accurate — just too slow to run at retrieval scale, which is why you do it on the reduced candidate set.
I used cross-encoder/ms-marco-MiniLM-L-6-v2 from HuggingFace. It added about 80 ms to latency on a CPU for reranking 20 candidates, which was acceptable for us. Cohere’s Rerank API is the managed alternative — I haven’t used it in production but have heard good things.
The MMR gotcha: I added Maximal Marginal Relevance to reduce redundancy in retrieved chunks, thinking it would help. For some queries it did, but it also filtered out a chunk containing the exact relevant detail because a more general chunk was ranked higher and deemed “too similar.” My recall numbers actually dropped. I ended up disabling MMR and addressing redundancy through chunking strategy instead. Don’t assume it’s free until you’ve tested it on your specific dataset.
- If your corpus is keyword‑heavy, implement hybrid search.
- If retrieval quality still feels off after that, add a reranker — it’s often the single highest‑ROI improvement available.
- Be skeptical of MMR.
Evaluating Whether Any of This Actually Helps
Nobody talks about this enough: you need an eval harness before you start tuning, or you’re flying blind. I built mine with ragas (v0.1.x) and ~100 manually curated QA pairs from our actual documentation.
Four metrics I tracked
- Faithfulness — does the answer stick to what’s in the retrieved context?
- Answer relevancy — is the answer actually responsive to the question?
- Context precision — are the retrieved chunks relevant?
- Context recall — is the relevant information making it into the retrieved chunks at all?
My initial pipeline had fine faithfulness (the LLM wasn’t hallucinating beyond what was in the retrieved docs) but terrible context recall — I was only surfacing the relevant chunk ~60 % of the time. That’s why the parental‑leave answer was wrong: the relevant doc wasn’t making it into the top‑5 results. Once I identified that, the fix was obvious — better chunking plus hybrid search to catch “parental leave” as a keyword match.
Without the eval setup I would have kept tweaking the prompt. That’s the trap.
What I’d Actually Build With Today
- Start with
pgvectorif the team is already on Postgres. It removes an infrastructure dependency and is plenty capable for most internal tools. - Migrate to Qdrant once you hit scale issues or need hybrid search badly; the data migration is not that painful.
Embeddings
text-embedding-3-largefrom OpenAI (3072 dims) ornomic-embed-textfor a solid open‑source option.
I’m not convinced the latest embedding models are worth the cost premium over text-embedding-3-large for most RAG use cases — though I haven’t benchmarked the most recent releases.
Retrieval strategy
- Parent‑document retrieval over semantic chunking: simpler to implement, easier to debug, better performance in my tests.
- Hybrid search from day one if you control your vector‑DB choice. BM25 is not dead.
Reranking
- One cross‑encoder reranker pass before sending context to the LLM. The latency cost is worth it.
Evaluation
- Build your eval harness before anything else. Even 50 QA pairs is enough to tell you whether your changes are helping or hurting. Without it you’re just iterating on vibes, and I spent two weeks learning that the hard way.