Building RAG on the Edge: Cloudflare Workers, Vectorize, and FAISS — What Actually Works

Published: (January 6, 2026 at 09:20 PM EST)
3 min read
Source: Dev.to

Source: Dev.to

Cover image for “Building RAG on the Edge: Cloudflare Workers, Vectorize, and FAISS – What Actually Works”

Karol

Introduction

I built a Retrieval‑Augmented Generation (RAG) system using Cloudflare Workers, their new Vectorize offering, and FAISS. Here’s what I learned:

  • Edge computing is sexy, but it’s not a silver bullet.
  • The stack works, but you’ll hit friction points that traditional approaches avoid.

This isn’t a tutorial—it’s a post‑mortem of real trade‑offs.

Section 1: The Architecture That Looked Good on Paper

Why I Picked This Stack

  • Cloudflare Workers – promised serverless inference without cold starts.
  • Vectorize – managed vector storage at the edge.
  • FAISS – blazing‑fast local similarity search.

On paper: zero latency, zero ops overhead, cost efficiency. In practice, it was messier.

The Setup

  1. Store embeddings in Vectorize (Cloudflare’s managed vector DB backed by Postgres).
  2. Deploy a Worker that chunks documents and generates embeddings using a local LLM.
  3. Use FAISS as a fallback for local‑only inference during development.
# Install dependencies
npm install @cloudflare/workers-types wrangler faiss-node
pip install faiss-cpu langchain sentence-transformers

# Configure wrangler.toml
[env.production]
vars = { VECTORIZE_INDEX = "rag-index-prod" }

The architecture looked solid, but the execution revealed three hard problems.

Section 2: Where Cloudflare Workers + Vectorize Actually Break

Problem 1 – Worker Execution Timeout vs. Embedding Generation

Cloudflare Workers have a 30‑second CPU timeout in production. Generating embeddings for documents longer than ~2 000 tokens consistently exceeds this limit1.

Work‑around: Offload heavy lifting to a background job or Durable Objects, which defeats the “serverless simplicity” pitch.

# This works locally. This fails at the edge.
def embed_document(text: str):
    model = SentenceTransformer('all-MiniLM-L6-v2')
    embeddings = model.encode(text, show_progress_bar=True)  # 5‑15 s per doc
    return embeddings

Problem 2 – Vectorize API Latency Isn’t What They Advertise

Vectorize queries show 200‑400 ms response times even for simple similarity searches. The marketing says “edge speed,” but you’re still hitting a database round‑trip. A local FAISS index completes the same query in ≈ 1 second, which is acceptable.

Cloudflare Workers + Vectorize fails for:

  • RAG pipelines requiring sub‑200 ms retrieval, reranking, or heavy inference.
  • The abstraction leaks – you still end up managing Durable Objects, KV fallbacks, and external services.

Local RAG wins because:

BenefitExplanation
No network overheadAll operations run in‑process.
State management is trivialKeep loaded models in memory.
Inference quality is higherRun smaller, faster models locally without timeout pressure.
Cost is predictableNo per‑request fees on retrieval.

Where edge computing actually makes sense: lightweight inference (classification, routing), not RAG.

The lesson: Don’t use Cloudflare Workers just because they’re trendy. Use them only when your problem fits their constraints. RAG doesn’t. Deploy locally or to a traditional server with a GPU, and you’ll ship faster and save money.

Sources

5

Footnotes

  1. Cloudflare Workers enforce a 30‑second CPU execution limit in production environments.

Back to Blog

Related posts

Read more »

Rapg: TUI-based Secret Manager

We've all been there. You join a new project, and the first thing you hear is: > 'Check the pinned message in Slack for the .env file.' Or you have several .env...

Technology is an Enabler, not a Saviour

Why clarity of thinking matters more than the tools you use Technology is often treated as a magic switch—flip it on, and everything improves. New software, pl...