Build a Serverless RAG Engine for $0
Source: Dev.to
Introduction: The Problem with “Toy” RAG Apps
Most RAG tutorials skip the hard parts that actually matter in production:
- No security model: Users can access each other’s private data.
- Naive file handling: Large uploads crash your Node.js server.
- Expensive infra: AWS egress fees and managed vector DBs drain your wallet.
- Blocking operations: Processing files freezes your entire API.
We are going to solve all of these using a production‑proven architecture.
The $0 Tech Stack
Every piece of this stack has a generous free tier:
- Cloudflare R2: S3‑compatible storage with zero egress fees.
- Gemini 2.5 Flash: High‑performance LLM with a free tier of 15 requests/minute.
- PostgreSQL + pgvector: Battle‑tested database with native vector support.
- BullMQ: Redis‑backed job queue to handle heavy processing in the background.
Step 1: Understanding the Architecture
We follow a 4‑phase workflow designed for scale:
- Direct‑to‑Cloud Uploads – Browser uploads files directly to R2 using presigned URLs. Your server never touches the raw bytes, preventing memory crashes.
- Asynchronous Ingestion – A BullMQ worker handles the “heavy lifting”—downloading, chunking, and embedding—without blocking your API.
- Hybrid Retrieval – PostgreSQL row‑level security ensures users only search their own data.
- Contextual Generation – Gemini generates answers with smart citations (temporary links to the source files).
Step 2: Zero‑Cost Storage with Cloudflare R2
Traditional uploads stream data through your server. If 10 users upload 50 MB files simultaneously, your server spikes by 500 MB and likely crashes.
The Reservation Pattern
We issue a time‑limited presigned URL. The browser sends the file directly to Cloudflare.
// Backend: Generate the permission
const { signedUrl, fileKey, fileId } = await uploadService.generateSignedUrl(
fileName,
fileType,
fileSize,
isPublic,
req.user
);
res.send({ signedUrl, fileKey, fileId });
Step 3: Contextual Query Rewriting
If a user asks “Who is the CEO of Tesla?” followed by “What about SpaceX?”, a naive vector search for “What about SpaceX?” will fail because it lacks context.
We use Gemma 3‑12B to rewrite queries in ~200 ms:
// User: "What about SpaceX?"
// Gemma Rewrites: "Who is the CEO of SpaceX?"
This ensures your vector search actually finds the right documents.
Step 4: Hybrid Search with Row‑Level Security
Multi‑tenancy is the biggest hurdle in RAG. You can’t let User A see User B’s documents. Instead of filtering in JavaScript (which is slow and buggy), we do it in SQL:
SELECT d.content,
d.metadata,
f."originalName",
(d.embedding ${vectorQuery}::vector) AS distance
FROM "Document" d
LEFT JOIN "File" f ON d."fileId" = f.id
WHERE (d."userId" = ${userId} OR f."isPublic" = true)
ORDER BY distance ASC
LIMIT 5;
This enforces security at the database layer—no accidental data leaks.
Step 5: Visual RAG – Understanding Images
Traditional RAG is text‑only. If you upload a receipt, most systems fail. We use Gemini Vision to describe the image in detail, then embed that description.
Input (image): Photo of coffee receipt
Gemini Vision Output:
"Starbucks receipt, Jan 15, 2026. Grande Latte $5.45. Paid with Visa..."
Now, when you search “How much did I spend at Starbucks?”, the system finds the image because of its semantic description.
Conclusion: Build vs Buy
Commercial RAG solutions can cost $1,900+/year. By building this architecture, you save that money while gaining skills in:
- Distributed systems (BullMQ)
- Vector database optimization (pgvector)
- Cloud security (presigned URLs)
🚀 Want the Full Source Code?
If you want to save 40+ hours of setup, the Node.js Enterprise Launchpad packages this entire production‑ready architecture (RAG pipeline, Auth, RBAC, Socket.io, Docker configurations).
- Standard Price: $20
- Launch Special: $4 (80 % OFF)