Running a RAG Pipeline in a Production Full-Stack Application (Without a Vector Database)
Source: Dev.to
Note: If you haven’t read the first article, you can check it out here: [Insert link to previous post]
Overview
In this post we take the RAG pipeline from the backend‑only version and wrap it in a complete front‑end application. The result is a full‑stack app that lets you:
- Sign up / log in (via AWS Cognito).
- Upload PDF documents.
- Wait for the documents to be indexed (asynchronously).
- Ask questions that are answered solely from the content of the uploaded PDFs.
The goal is to demonstrate how the low‑budget RAG pipeline behaves with real users and traffic, without any hidden latency tricks or massive‑scale assumptions. It’s ideal for early‑stage experimentation, internal tools, or MVPs where feature validation matters more than peak performance.
The source code lives in the same GitHub repository as the previous post, on the full-stack-implementation branch.
Tech Stack
| Layer | Technology | Purpose |
|---|---|---|
| Frontend | Next.js (App Router) | React framework with server‑side rendering & routing |
| Tailwind CSS 4 | Utility‑first styling | |
| shadcn/ui | Component library built on Radix UI primitives (New York style) | |
| NextAuth.js | Authentication integration for Next.js | |
| Lucide React | Icon set | |
| Backend | AWS Lambda | Serverless API layer |
| AWS Cognito | User pool & authentication | |
| Amazon S3 (presigned URLs) | PDF storage | |
| Amazon DynamoDB | Embeddings & metadata store | |
| Amazon Bedrock | LLM & embedding models | |
| Amazon EventBridge + SQS | Decoupled, asynchronous document indexing |
User Flow
-
Authentication – The landing page shows a Sign‑up / Login form.
Powered by the Authentication stack (Cognito + Login / Register Lambdas). -
Upload PDFs – After signing in, users can drag‑and‑drop PDFs.
Files are uploaded to S3 via presigned URLs; an EventBridge rule pushes a message to an SQS queue, which triggers the indexing Lambda. -
Indexing – The Lambda extracts text, creates embeddings (via Bedrock), and stores the vectors & metadata in DynamoDB.
This step is fully asynchronous – users are not blocked. -
Chat UI – Once indexing finishes, the documents appear in the chat interface.
Select one or more documents, type a question, and the Ask‑Questions Lambda retrieves the relevant chunks, runs a Bedrock LLM inference, and returns the answer.
Demo GIF
Insert GIF showing PDF upload flow here
Architecture Diagram
Insert diagram of the Lambda indexing flow (S3 → EventBridge → SQS → Lambda → DynamoDB) here
Performance Highlights
Document Indexing
| PDF Size | # of PDFs | Avg. Lambda Duration |
|---|---|---|
| ~400 KB | 1 | ~8 s |
| ~200 KB each | 2 | ~8 s total |
Indexing remains comfortably asynchronous; users are never blocked.
I have indexed 20+ documents of similar size and observed roughly 4 s per document for the indexing Lambda (including DynamoDB write).
Get‑Documents API
Testing with Postman
- First request: slower due to cold start.
- Subsequent requests: warm start, consistently fast.
Insert screenshot of Postman response & Lambda duration here
Cold starts are infrequent in a multi‑user environment, so most users will see rapid responses.
Ask‑Questions Lambda
| Scenario | Input | Avg. Duration (cold) | Avg. Duration (warm) |
|---|---|---|---|
| 1 document | “What is this document about?” | ~1.2 s | ~0.6 s |
| 4 documents | Same question, multiple docs | ~2.0 s | ~1.1 s |
Insert graph of Lambda execution times (cold vs. warm) here
The Lambda simply receives a payload containing the selected document IDs and the user’s question, fetches the relevant embeddings from DynamoDB, runs a Bedrock LLM inference, and returns the answer.
Cost & Operational Benefits
- No long‑running services → Zero fixed monthly fees.
- Serverless only (Lambda, S3, DynamoDB, Bedrock) → Pay‑as‑you‑go.
- Separate stacks for authentication and document processing → Independent scaling & easy API‑gateway protection (import Cognito User Pool ID).
Every architectural decision is aimed at minimizing cost and operational overhead while still delivering a functional RAG experience.
Next Steps
- Add caching (e.g., DynamoDB TTL or CloudFront) to further reduce cold‑start impact.
- Implement pagination for large document sets.
- Explore fine‑tuning Bedrock models for domain‑specific accuracy.
Feel free to clone the repo, switch to the full-stack-implementation branch, and experiment with your own PDFs!
Happy building!
Summary of the Four Documents
When inquiring about one document, the cold‑start duration is around 3 seconds, while a warm Lambda executes in under 1.5 seconds.
For four documents, the cold‑start duration is just above 4 seconds, and the warm‑start duration is just under 4 seconds.
Key takeaway: An average user won’t experience many Lambda cold starts. For a budget RAG system that uses very few AWS services (to keep complexity minimal), these performances are acceptable.
This application confirms what the first blog post theorized — a DynamoDB‑based RAG system is not just cheap, but usable, and if you understand its limits, it has acceptable performance.
Trade‑offs
- Lambda cold starts are real.
- Document query latency behaves as expected and increases as the document count grows.
For early‑stage applications, internal tools, experiments, or even a hackathon competition where you don’t want to break the bank but still need a working RAG system, this setup does its job well. It lets you ship the product, gather user feedback, and delay expensive architectural decisions until the data forces your hand.