12 Things Nobody Tells You About Building a Production RAG System

Published: 1 month ago (March 20, 2026 at 08:31 AM EDT)

10 min read

Source: Dev.to

Source: Dev.to

I spent several months building Ask360, a multi-tenant RAG platform that lets anyone upload documents and deploy an AI assistant with source citations and confidence badges. The UI and CRUD took a few weeks. The search quality, failover, and production hardening? That’s where the real engineering happened. Here are the problems I hit that tutorials don’t cover, and the solutions that actually work in production. Every RAG tutorial shows one database. In production, I use two: MariaDB for user accounts, projects, documents, billing, and usage tracking. Everything the admin panel touches. PostgreSQL + pgvector for vector embeddings, semantic cache, and chunk storage. Everything the search engine touches. Why not one? MariaDB is excellent for the CRUD-heavy admin layer. Yii2’s ActiveRecord, migrations, fixtures, and auth all work out of the box. But MariaDB has no vector similarity search. PostgreSQL with pgvector gives me HNSW indexes for fast approximate nearest neighbor search, ts_rank_cd for BM25 keyword scoring, and trigram indexes for fuzzy matching, all in SQL. The two databases communicate through the documents table. PHP extracts text and stores it in MariaDB. The Node.js engine reads it, chunks it, embeds it, and stores vectors in PostgreSQL. The project config syncs to PostgreSQL via a simple UPSERT so the engine can join against it during search. PHP (Yii2) Node.js (Express) MariaDB PostgreSQL + pgvector

user, company, project project_ref (synced) rag_source (text, status) rag_chunk (content, embedding) billing, usage, alerts semantic_cache

The admin layer doesn’t touch vectors. The engine doesn’t touch users. Clean separation. Either can be replaced independently. The tradeoff: You need to keep project config in sync between the two databases. A 50-byte UPSERT on project save handles this. Worth it for the architectural simplicity of letting each database do what it’s best at. I started with local ONNX embeddings: first Xenova/all-MiniLM-L6-v2 (384-dim), then upgraded to Snowflake/snowflake-arctic-embed-m-v1.5 (256-dim, Matryoshka). The appeal was obvious: no API costs, no rate limits, data stays on your server. The reality on a production server: Each ONNX inference ties up a CPU core for ~15ms per chunk On a 6-core server, 4 embed instances process ~260 chunks/second. Sounds fast until a user uploads a 600-chunk document and the server goes to 100% CPU for 2+ minutes ONNX Runtime’s spin-wait burns CPU even when idle. Multiple threads per instance waste cycles on synchronization instead of computation Model loading takes 3-5 seconds and ~105MB RAM per instance Tuning OMP_NUM_THREADS, instance counts, and batch sizes is a combinatorial nightmare that changes with every server Then I tried Gemini’s embedding API (gemini-embedding-001, 256-dim):

Metric Local ONNX Gemini API

Throughput ~260 chunks/sec (6-core) ~1000+ chunks/sec

CPU load during embed 100% ~5%

Cost per 1000 chunks $0 (but CPU cost) ~$0.001

Cold start 3-5 sec model load None

Tuning required Extensive None

The API is faster, cheaper than the CPU time it saves, and needs zero tuning. I kept local ONNX as a fallback (configured per-project via EMBED_PROVIDER), but production runs on Gemini embeddings. The lesson: “Free” local inference isn’t free when you account for CPU contention, tuning time, and the ops burden of managing ONNX threads across PM2 instances. API embeddings are a solved problem. Pay the fraction of a cent and focus on search quality instead. I added cross-encoder reranking using Xenova/ms-marco-MiniLM-L-6-v2 to improve search quality. Every query returned a confidence score of exactly 0.731. Completely broken. The trap: I used pipeline(“text-classification”, …) from @huggingface/transformers. This applies softmax internally. For single-label models (num_labels=1), softmax over a single logit always produces 1.0. Always. The fix is to use AutoModelForSequenceClassification directly and apply sigmoid yourself: const output = await model(inputs); const logit = output.logits.data[0]; const score = 1 / (1 + Math.exp(-logit)); // sigmoid, not softmax

This is documented in sentence-transformers issue #2874, but you’d never find it from the model card. If you’re building RAG with JavaScript and using cross-encoders, skip pipeline() and use the model directly. After fixing the sigmoid issue, factoid queries worked perfectly:

Query CE Score Correct?

“How many PTO days do employees get per year?” 0.9994 Yes

“What health insurance does the company provide?” 0.9891 Yes

“What is the vacation policy?” 0.0002 No

That last query retrieved the right documents (vector similarity 0.38), but the cross-encoder scored them near zero. Why: MS MARCO’s training data marks passages as relevant only if a human used them to write a specific answer. The model learned “Does this passage ANSWER this question?” rather than “Is this passage ABOUT this topic?” Topical queries like “What is the vacation policy?” have no direct Q&A match, so the cross-encoder rejects them. The fix: multi-signal confidence. Combine two signals, where either one can promote the tier: if (topCEScore >= 0.5) // verified else if (topCEScore >= 0.02 || vecSim >= 0.35) // grounded else if (topCEScore >= 0.001 || vecSim >= 0.25) // mixed else // ungrounded

Now “What is the vacation policy?” scores as grounded (vector catches it), while “How do quantum computers work?” correctly scores as ungrounded (both signals low). The lesson: Never trust a single neural signal. Let multiple weak signals vote. The first version of hybrid search added vector similarity and keyword scores together: SELECT *, (cosine_score + MIN(ts_rank_cd * 0.25, 0.20)) AS combined FROM rag_chunk ORDER BY combined DESC

This is t

he score-mixing anti-pattern. Vector similarity (0-1 range) and keyword rank scores (completely different scale) distort each other when added. Tuning the weights is a moving target. Reciprocal Rank Fusion (RRF) solves this elegantly: rrf_score(doc) = 1/(60 + rank_vector) + 1/(60 + rank_keyword)

Each document gets a score based purely on its rank position in each list, not its raw score. Documents appearing in both lists naturally score higher. The k=60 constant (from the original 2009 paper) dampens outliers. No weight tuning. No score normalization. Works across any two ranking signals. Total fusion time: e.message?.includes(c))) return true; current = e.cause ?? null; // Walk up } return false; }

Now the full chain works: ETIMEDOUT -> retry 3x -> still failing -> failover to Claude -> retry 3x -> success. Or if both providers are down, the circuit breaker opens. When your primary LLM provider (Gemini) goes down, you fail over to Claude. But you also need to stop hammering the dead provider and know when to try again. A fixed cooldown doesn’t work. A 5-minute cooldown retries too aggressively during a major outage. A 1-hour cooldown recovers too slowly from a brief hiccup. Escalating cooldown: const COOLDOWN_SCHEDULE_MINS = [5, 15, 60, 180, 600];

2 consecutive failures -> circuit opens, 5 min cooldown Probe fails after cooldown -> escalate to 15 min Still down -> 60 min, then 3 hours, then 10 hours (capped) The state machine: CLOSED -> OPEN -> HALF_OPEN -> (success) -> CLOSED or -> (failure) -> OPEN with escalated cooldown. Key decision: Circuit state is in-memory only. Restarting the engine resets all circuits to CLOSED. This is intentional because if the engine restarts, the provider may have recovered. No stale state. Our architecture has PHP (Yii2) orchestrating document processing and Node.js doing the computation. They communicate over HTTP. The problem:

Operation Duration HTTP’s comfort zone?

Embed 20 chunks 1-3 sec Borderline

Embed 200+ chunks 30-60 sec No

LLM during outage + retries 60+ sec Definitely no

PHP’s file_get_contents() silently returns false on timeout. No partial result. No error message. Documents get stuck in “processing” forever. Three-layer solution: Break jobs into HTTP-sized pieces. Batch of 20 chunks per call, each completes in 1-3 seconds. Parallelize across PM2 cluster. Large documents split into N parts, each acquiring a MySQL advisory lock for concurrency control. IDs-only messaging. PHP sends {sourceId: 123}, engine fetches text from DB, chunks, embeds, stores, and calls back. Only ~50 bytes cross the wire. Advisory locks auto-release on process death. If an embed worker crashes, the slot is freed immediately. No cleanup needed. Running embedding and chat in the same process is a mistake. Embedding is CPU-hungry and blocks the event loop. Chat needs low-latency SSE streaming. We run two PM2 apps from the same codebase, differentiated by an ENGINE_ROLE env var: // ecosystem.config.cjs { name: “rag-engine-prd-chat”, env: { ENGINE_ROLE: “chat”, PORT: 3001 }, instances: 2, }, { name: “rag-engine-prd-embed”, env: { ENGINE_ROLE: “embed”, PORT: 3005 }, instances: 2, }

The chat app handles streaming, project config, reranking, and health checks. The embed app handles document embedding only. Scale independently. Instance counts auto-detect based on provider: API providers (Gemini/OpenAI): 1 chat, 1 embed (API does the heavy lifting) Local ONNX: CPU-based (chat = max(2, CPUs0.25), embed = CPUs0.5) When an embed process dies mid-flight (OOM kill, server restart, unhandled exception), the document stays in status = ‘processing’ forever. No HTTP callback comes back. No heartbeat protocol exists. Pragmatic fix: A cron job runs every 5 minutes, checks for documents stuck in “processing” for more than the threshold, and resets them to “pending”: STALE_THRESHOLD_API = 300 // 5 min for API providers (fast) STALE_THRESHOLD_LOCAL = 1800 // 30 min for local ONNX (slow)

The recovery uses an atomic UPDATE WHERE status=‘processing’ to prevent overwriting a legitimate callback that arrives at the same moment. This is a workaround for HTTP’s lack of connection lifecycle awareness. A WebSocket would know immediately when the other end dies. But for a system where embeds happen in the background and take seconds to minutes, a periodic recovery sweep is simpler and just as effective. Input protection catches problems before they reach the LLM. Output protection catches problems in the response. Both run as ordered rule pipelines: Inbound (before LLM): Rate limiting (per-IP, per-project, per-minute) Content sa

fety filter (offensive/harmful queries) Prompt injection detection (attempts to override the system prompt) Outbound (after LLM): PII redaction (SSNs, credit cards, phone numbers) runs in real-time as tokens stream Hallucination flagging via confidence badges warns when document support is weak Response length limits prevent runaway generation The key design: PII redaction runs on each streaming chunk in real time, so sensitive data never reaches the browser. Other checks (hallucination, length) evaluate the complete response for accuracy. After all this tuning, here’s where the system landed:

Metric Value

Total retrieval + rerank 50-80ms

RRF fusion <1ms

Cross-encoder (5 passages) ~30ms

Neighbor expansion ~2ms

Semantic cache hit 1-2ms

Embedding (20 chunks, API) 1-3 sec

Circuit breaker initial cooldown 5 min

Circuit breaker max cooldown 10 hours

Stale recovery check every 5 min

Use API embeddings from day one. Local ONNX sounds appealing (“no API costs!”) but the CPU contention, thread tuning, and cold starts aren’t worth it. Gemini embeddings cost fractions of a cent and need zero ops. Start with RRF + multi-signal from day one. The single-score approach wasted a week of debugging before I understood the MS MARCO bias. Split chat and embed processes immediately. Running them together caused latency spikes I spent days chasing. Build the circuit breaker before you need it. The first Gemini outage hit at 2am. I was manually switching providers until the circuit breaker was ready. Use advisory locks instead of queue tables. They auto-release on crash, which eliminates an entire class of stuck-job bugs. Ask360 is live with a free tier: 1 project, 2 documents, 100 queries/month, forever free. Upload some docs and see the confidence badges and source citations in action. If you’re building something similar, I hope this saves you some of the debugging I went through. The RAG tutorials show you how to wire up vector search in 50 lines. The other 10,000 lines are what this post is about. Built with PHP (Yii2), Node.js, PostgreSQL + pgvector, Gemini, Claude, and a lot of production debugging.

12 Things Nobody Tells You About Building a Production RAG System

Related posts

Your Pipeline Is 21.5h Behind: Catching Startups Sentiment Leads with Pulsebit

The Claude Code CVE That Should Change How You Review AI-Generated Code

Are Banking Apps Safe? Why Yes, But Your Habits Matter More

45,000 Layoffs in March. Companies Blamed AI. The Numbers Say Otherwise.