The Missing Step in RAG: Why Your Vector DB is Bloated (and how to fix it locally)
Source: Dev.to
Introduction
We spend countless hours optimizing LLM prompts, tweaking retrieval parameters (k‑NN), and choosing the best embedding models. Yet we often ignore the elephant in the room: data quality.
If you are building a RAG (Retrieval‑Augmented Generation) pipeline using internal company data—logs, tickets, documentation, or emails—you have likely encountered the Semantic Duplicate Problem.
The Problem: Different Words, Same Meaning
Standard deduplication tools (e.g., pandas.DataFrame.drop_duplicates() or SQL DISTINCT) work on a string level and look for exact matches.
Example log entries
Error: Connection to database timed out after 3000ms.
DB Connection Failure: Timeout limit reached (3s).
To a standard script, these are two unique rows.
To an LLM (and to a human), they are identical.
When you ingest 10,000 such rows into a vector database (Pinecone, Milvus, Weaviate):
- 💸 Cost – Storing redundant vectors isn’t free.
- 📉 Retrieval quality – A user asking “Why did the DB fail?” receives multiple variations of the same error, crowding out other relevant information.
- 😵 Model hallucinations – Repetitive context degrades output quality.
The Solution: Semantic Deduplication
Deduplication must be based on meaning (vectors), not just syntax (text).
I couldn’t find a lightweight, privacy‑first tool that runs locally without a Spark cluster or external API calls, so I built one: EntropyGuard.
EntropyGuard – A Local‑First ETL Engine
EntropyGuard is an open‑source CLI tool written in Python that sanitizes data before it reaches your vector database. It addresses three critical problems:
- Semantic Deduplication – Uses
sentence‑transformersand FAISS to find duplicates by cosine similarity. - Sanitization – Strips PII (emails, phone numbers) and HTML noise.
- Privacy – Runs 100 % locally on CPU; no data exfiltration.
Tech Stack (Hard Tech)
| Component | Choice | Reason |
|---|---|---|
| Engine | Polars LazyFrame | Streaming execution; can process a 10 GB CSV on a laptop with 16 GB RAM without loading everything into memory. |
| Vector Search | FAISS (Facebook AI Similarity Search) | Blazing‑fast CPU‑only vector comparisons. |
| Chunking | Native recursive chunker (paragraph → sentence) | Avoids the bloat of heavy frameworks like LangChain. |
| Ingestion | Excel (.xlsx), Parquet, CSV, JSONL | Supported natively. |
How It Works (The Code)
Installation
pip install "git+https://github.com/DamianSiuta/entropyguard.git"
Running an Audit (Dry Run)
entropyguard \
--input raw_data.jsonl \
--output clean_data.jsonl \
--dedup-threshold 0.85 \
--audit-log audit_report.json
The dry run generates a JSON audit log that shows exactly which rows would be dropped and why—crucial for compliance teams.
What Happens Under the Hood
- Embedding – Generates embeddings locally using a small model such as
all-MiniLM-L6-v2. - Clustering – Clusters embeddings with FAISS.
- Deduplication – Removes neighbors whose cosine similarity exceeds the specified threshold (e.g., 0.85).
Benchmark: 99.5 % Noise Reduction
A stress test on a synthetic dataset of 10 000 rows (50 unique signals with heavy noise: HTML tags, rephrasing, typos) yielded:
- Raw Data: 10 000 rows
- Cleaned Data: ~50 rows
- Execution Time: (not specified)
I’m actively seeking feedback from the data‑engineering community. If you’re struggling with dirty RAG datasets, give EntropyGuard a spin and let me know how it works for you!