Build a Serverless RAG Engine for $0

Published: 3 days ago (February 13, 2026 at 07:30 AM EST)

4 min read

Source: Dev.to

Source: Dev.to

Introduction: The Problem with “Toy” RAG Apps

Most RAG tutorials skip the hard parts that actually matter in production:

No security model: Users can access each other’s private data.
Naive file handling: Large uploads crash your Node.js server.
Expensive infra: AWS egress fees and managed vector DBs drain your wallet.
Blocking operations: Processing files freezes your entire API.

We are going to solve all of these using a production‑proven architecture.

The $0 Tech Stack

Every piece of this stack has a generous free tier:

Cloudflare R2: S3‑compatible storage with zero egress fees.
Gemini 2.5 Flash: High‑performance LLM with a free tier of 15 requests/minute.
PostgreSQL + pgvector: Battle‑tested database with native vector support.
BullMQ: Redis‑backed job queue to handle heavy processing in the background.

Step 1: Understanding the Architecture

We follow a 4‑phase workflow designed for scale:

Direct‑to‑Cloud Uploads – Browser uploads files directly to R2 using presigned URLs. Your server never touches the raw bytes, preventing memory crashes.
Asynchronous Ingestion – A BullMQ worker handles the “heavy lifting”—downloading, chunking, and embedding—without blocking your API.
Hybrid Retrieval – PostgreSQL row‑level security ensures users only search their own data.
Contextual Generation – Gemini generates answers with smart citations (temporary links to the source files).

Step 2: Zero‑Cost Storage with Cloudflare R2

Traditional uploads stream data through your server. If 10 users upload 50 MB files simultaneously, your server spikes by 500 MB and likely crashes.

The Reservation Pattern

We issue a time‑limited presigned URL. The browser sends the file directly to Cloudflare.

// Backend: Generate the permission
const { signedUrl, fileKey, fileId } = await uploadService.generateSignedUrl(
  fileName,
  fileType,
  fileSize,
  isPublic,
  req.user
);
res.send({ signedUrl, fileKey, fileId });

Step 3: Contextual Query Rewriting

If a user asks “Who is the CEO of Tesla?” followed by “What about SpaceX?”, a naive vector search for “What about SpaceX?” will fail because it lacks context.

We use Gemma 3‑12B to rewrite queries in ~200 ms:

// User: "What about SpaceX?"
// Gemma Rewrites: "Who is the CEO of SpaceX?"

This ensures your vector search actually finds the right documents.

Step 4: Hybrid Search with Row‑Level Security

Multi‑tenancy is the biggest hurdle in RAG. You can’t let User A see User B’s documents. Instead of filtering in JavaScript (which is slow and buggy), we do it in SQL:

SELECT d.content,
       d.metadata,
       f."originalName",
       (d.embedding  ${vectorQuery}::vector) AS distance
FROM "Document" d
LEFT JOIN "File" f ON d."fileId" = f.id
WHERE (d."userId" = ${userId} OR f."isPublic" = true)
ORDER BY distance ASC
LIMIT 5;

This enforces security at the database layer—no accidental data leaks.

Step 5: Visual RAG – Understanding Images

Traditional RAG is text‑only. If you upload a receipt, most systems fail. We use Gemini Vision to describe the image in detail, then embed that description.

Input (image): Photo of coffee receipt
Gemini Vision Output:

"Starbucks receipt, Jan 15, 2026. Grande Latte $5.45. Paid with Visa..."

Now, when you search “How much did I spend at Starbucks?”, the system finds the image because of its semantic description.

Conclusion: Build vs Buy

Commercial RAG solutions can cost $1,900+/year. By building this architecture, you save that money while gaining skills in:

Distributed systems (BullMQ)
Vector database optimization (pgvector)
Cloud security (presigned URLs)

🚀 Want the Full Source Code?

If you want to save 40+ hours of setup, the Node.js Enterprise Launchpad packages this entire production‑ready architecture (RAG pipeline, Auth, RBAC, Socket.io, Docker configurations).

Standard Price: $20
Launch Special: $4 (80 % OFF)

👉 Get the Source Code & Template Here

👉 Read the full tutorial here

Build a Serverless RAG Engine for $0

Introduction: The Problem with “Toy” RAG Apps

The $0 Tech Stack

Step 1: Understanding the Architecture

Step 2: Zero‑Cost Storage with Cloudflare R2

The Reservation Pattern

Step 3: Contextual Query Rewriting

Step 4: Hybrid Search with Row‑Level Security

Step 5: Visual RAG – Understanding Images

Conclusion: Build vs Buy

🚀 Want the Full Source Code?

Related posts

Getting Started with Ollama: From Installation to Testing

Why Your AI Coding Agent Gets Exponentially More Expensive (and What to Do About It)

Enabling AI Agents to Use a Real Debugger Instead of Logging

8-Bit Music Theory: How They Made The Great Sea Feel C U R S E D

Introduction: The Problem with “Toy” RAG Apps

The $0 Tech Stack

Step 1: Understanding the Architecture

Step 2: Zero‑Cost Storage with Cloudflare R2

The Reservation Pattern

Step 3: Contextual Query Rewriting

Step 4: Hybrid Search with Row‑Level Security

Step 5: Visual RAG – Understanding Images

Conclusion: Build vs Buy

🚀 Want the Full Source Code?

Related posts

Getting Started with Ollama: From Installation to Testing

Why Your AI Coding Agent Gets Exponentially More Expensive (and What to Do About It)

Enabling AI Agents to Use a Real Debugger Instead of Logging

8-Bit Music Theory: How They Made The Great Sea Feel C U R S E D

Conclusion: Build vs Buy