Zero-Downtime Embedding Migration: Switching from text-embedding-004 to text-embedding-3-large in Production
Source: Dev.to
The Situation
Service: RAG retrieval service using pgvector on PostgreSQL
Old model: text-embedding-004 (deprecated)
New model: text-embedding-3-large (768 dimensions)
Data volume: Thousands of embedded documents
Constraint: Zero downtime, zero data loss, production traffic must keep flowing
Step 1: Make the Model Configurable
Before anything else, stop hard‑coding the model name:
# Before (hardcoded in 6 places)
response = openai.embeddings.create(
model="text-embedding-004",
input=text,
)
# After (configured once)
EMBED_MODEL = os.getenv("EMBED_MODEL", "text-embedding-3-large")
EMBED_DIMENSIONS = int(os.getenv("EMBED_DIMENSIONS", "768"))
response = openai.embeddings.create(
model=EMBED_MODEL,
input=text,
dimensions=EMBED_DIMENSIONS,
)
Two environment variables make the difference between a 2‑day migration and a 2‑week one.
Step 2: Add New Columns (Don’t Replace)
-- Migration: add new embedding column alongside the old one
ALTER TABLE documents
ADD COLUMN embedding_v2 vector(768);
CREATE INDEX CONCURRENTLY idx_documents_embedding_v2
ON documents USING ivfflat (embedding_v2 vector_cosine_ops)
WITH (lists = 100);
Using CONCURRENTLY builds the index without locking the table, so production reads continue uninterrupted.
Step 3: Batch Re‑embedding Script
import asyncio
from tqdm import tqdm
async def re_embed_batch(session, documents, batch_size=50):
"""Re‑embed documents in batches with progress tracking."""
for i in tqdm(range(0, len(documents), batch_size)):
batch = documents[i:i + batch_size]
texts = [doc.content for doc in batch]
# Batch embedding call
response = await openai.embeddings.create(
model=EMBED_MODEL,
input=texts,
dimensions=EMBED_DIMENSIONS,
)
for doc, embedding in zip(batch, response.data):
doc.embedding_v2 = embedding.embedding
await session.commit()
# Rate limiting
await asyncio.sleep(0.5)
Key features
- Batch processing – don’t embed one doc at a time.
- Progress bar – you need to know how long this takes.
- Rate limiting – embedding APIs have limits.
- Commits per batch – don’t hold a transaction for 10 K docs.
Step 4: Dry‑Run Validation
Before switching production traffic:
async def validate_migration(session, sample_size=100):
"""Compare search results between old and new embeddings."""
test_queries = get_random_queries(sample_size)
overlaps = []
for query in test_queries:
old_results = await search(session, query, column="embedding")
new_results = await search(session, query, column="embedding_v2")
# Check overlap
old_ids = {r.id for r in old_results[:10]}
new_ids = {r.id for r in new_results[:10]}
overlap = len(old_ids & new_ids) / len(old_ids)
overlaps.append(overlap)
if overlap :query_vec) AS similarity
FROM documents
ORDER BY {column} :query_vec
LIMIT :top_k
"""), {"query_vec": str(query_embedding), "top_k": top_k})
return results.fetchall()
Deploy with USE_V2_EMBEDDINGS=false. Verify everything works. Flip to true. If anything breaks, flip back instantly.
Step 6: Cleanup
After running with v2 for a week with no issues:
ALTER TABLE documents DROP COLUMN embedding;
ALTER TABLE documents RENAME COLUMN embedding_v2 TO embedding;
DROP INDEX idx_documents_embedding;
ALTER INDEX idx_documents_embedding_v2 RENAME TO idx_documents_embedding;
Lessons Learned
- Always abstract the embedding provider. Two env vars saved us from a multi‑file refactor.
- Add model‑version tracking to stored vectors. We didn’t; we should have.
- Build migration tooling before you need it. The batch script and validation tool are reusable.
- Side‑by‑side columns > in‑place replacement. The rollback story is instant.
- Dry‑run everything. Our validation caught three queries with low overlap that needed investigation.
Total impact: 48 hours, zero downtime, zero data loss.
Read the full migration story on my blog. Part of my “Production GCP Patterns” series — find me at humzakt.github.io.