STOP GUESSING: The Observability Stack I Built to Debug My Failing AI Agents
Source: Dev.to

The RAG pipeline is a black box. I got tired of guessing why my bot retrieved the wrong context, so I built an engine for reliable, observable vector retrieval and semantic content verification.
RAG and LLM verification are the new bottlenecks in AI development. I built MemVault (for reliable Hybrid Vector Retrieval) and ContextDiff (for deterministic AI Output Verification). The problem is observability; here are my solutions.
Tool 1: MemVault – The Observable Memory Server
I built MemVault to solve the complex retrieval‑integrity problem. Setting up dedicated vector databases is overkill for many projects, so I designed MemVault as a robust, open‑source Node.js wrapper around the reliable stack we already use: PostgreSQL + pgvector.
Hybrid Search 2.0: The End of Guesswork
Most RAG pipelines use only semantic search, which is brittle. MemVault ensures reliability with a weighted 3‑way hybrid score:
| Component | Technique | Weight |
|---|---|---|
| Semantic (Vector) | Cosine similarity via pgvector | 50 % |
| Exact Match (Keyword) | BM25 (Postgres tsvector) for IDs, error codes, etc. | 30 % |
| Recency (Time) | Decay function prioritising recent memories | 20 % |
The Visualizer: Debugging in Real‑Time
MemVault offers a dashboard that visualises the vector search as it happens. You can instantly see why a specific document was retrieved and what its weighted score was.
Live demo: (link omitted in original)
Setup: Choose Your Economic Reality
- Self‑Host (MIT License) – Run the entire stack (Postgres + Ollama for embeddings) 100 % offline via Docker. Ideal for privacy and zero API costs.
- Managed API (RapidAPI) – Use the hosted service to skip maintenance and infrastructure setup (Free tier available).
Quick Start (NPM SDK)
npm install memvault-sdk-jakops88
Tool 2: ContextDiff – Semantic Output Validation
If MemVault ensures you retrieve the right context, ContextDiff makes sure the LLM doesn’t ruin it.
Deterministic Semantic Verification
ContextDiff is a production‑ready FastAPI/Next.js monorepo that performs LLM‑powered comparison, providing a structured assessment:
- Risk Scoring – Objective 0‑100 risk score with a safety determination.
- Change Detection – Flags specific change types with reasoning:
- FACTUAL – Critical claims or certainty levels changed (e.g., “will” vs. “might”).
- TONE – Sentiment or formality shifted.
- OMISSION/ADDITION – Information was dropped or introduced.
Why Simple Diff Fails
Simple diff tools are useless for AI. ContextDiff detects that changing “Q1 2024” to “early 2024” is a semantic change in certainty (a risk), not just a string difference.
Use case: High‑stakes content validation (Legal, Medical, Finance) where maintaining the semantic integrity of the source is mandatory.
Demo: (link omitted in original)
Conclusion: Stop Debugging in the Dark
The future of reliable AI engineering hinges on observable, verifiable systems. If you’re tired of treating your RAG pipeline as a black box, explore these tools.
- MemVault source code: (link omitted in original)
- ContextDiff API & repository: (search for “ContextDiff”)