Building RAG on the Edge: Cloudflare Workers, Vectorize, and FAISS — What Actually Works
Source: Dev.to

Introduction
I built a Retrieval‑Augmented Generation (RAG) system using Cloudflare Workers, their new Vectorize offering, and FAISS. Here’s what I learned:
- Edge computing is sexy, but it’s not a silver bullet.
- The stack works, but you’ll hit friction points that traditional approaches avoid.
This isn’t a tutorial—it’s a post‑mortem of real trade‑offs.
Section 1: The Architecture That Looked Good on Paper
Why I Picked This Stack
- Cloudflare Workers – promised serverless inference without cold starts.
- Vectorize – managed vector storage at the edge.
- FAISS – blazing‑fast local similarity search.
On paper: zero latency, zero ops overhead, cost efficiency. In practice, it was messier.
The Setup
- Store embeddings in Vectorize (Cloudflare’s managed vector DB backed by Postgres).
- Deploy a Worker that chunks documents and generates embeddings using a local LLM.
- Use FAISS as a fallback for local‑only inference during development.
# Install dependencies
npm install @cloudflare/workers-types wrangler faiss-node
pip install faiss-cpu langchain sentence-transformers
# Configure wrangler.toml
[env.production]
vars = { VECTORIZE_INDEX = "rag-index-prod" }
The architecture looked solid, but the execution revealed three hard problems.
Section 2: Where Cloudflare Workers + Vectorize Actually Break
Problem 1 – Worker Execution Timeout vs. Embedding Generation
Cloudflare Workers have a 30‑second CPU timeout in production. Generating embeddings for documents longer than ~2 000 tokens consistently exceeds this limit1.
Work‑around: Offload heavy lifting to a background job or Durable Objects, which defeats the “serverless simplicity” pitch.
# This works locally. This fails at the edge.
def embed_document(text: str):
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(text, show_progress_bar=True) # 5‑15 s per doc
return embeddings
Problem 2 – Vectorize API Latency Isn’t What They Advertise
Vectorize queries show 200‑400 ms response times even for simple similarity searches. The marketing says “edge speed,” but you’re still hitting a database round‑trip. A local FAISS index completes the same query in ≈ 1 second, which is acceptable.
Cloudflare Workers + Vectorize fails for:
- RAG pipelines requiring sub‑200 ms retrieval, reranking, or heavy inference.
- The abstraction leaks – you still end up managing Durable Objects, KV fallbacks, and external services.
Local RAG wins because:
| Benefit | Explanation |
|---|---|
| No network overhead | All operations run in‑process. |
| State management is trivial | Keep loaded models in memory. |
| Inference quality is higher | Run smaller, faster models locally without timeout pressure. |
| Cost is predictable | No per‑request fees on retrieval. |
Where edge computing actually makes sense: lightweight inference (classification, routing), not RAG.
The lesson: Don’t use Cloudflare Workers just because they’re trendy. Use them only when your problem fits their constraints. RAG doesn’t. Deploy locally or to a traditional server with a GPU, and you’ll ship faster and save money.
Sources
5
Footnotes
-
Cloudflare Workers enforce a 30‑second CPU execution limit in production environments. ↩
