How I Built a 37 Million Row Search Engine That Returns Results in 200ms

Published: 1 month ago (March 9, 2026 at 06:45 PM EDT)

2 min read

Source: Dev.to

Source: Dev.to

Cover image for How I Built a 37 Million Row Search Engine That Returns Results in 200ms

Published on dev.to • leadvault.to

The Stack

ClickHouse – columnar database, the core of everything
FastAPI – Python backend, 6 uvicorn workers
Next.js – frontend
Supabase – auth and user credits
AWS EC2 m7i‑flex.large – 2 vCPU, 8 GB RAM

Why ClickHouse Over Postgres

The dataset: 37 million rows, ~15 columns (name, email, job title, company, location, seniority, etc.).

Typical query:

SELECT first_name,
       last_name,
       email_domain,
       job_title,
       company,
       location_city,
       location_country,
       seniority
FROM leads_clean
WHERE location_country = 'United States'
  AND seniority = 'manager'
LIMIT 100;

Postgres (with proper indexes): 8–15 seconds.
ClickHouse: 80–200 ms.

ClickHouse reads only the columns referenced in the query, making analytics‑style filters across many columns dramatically faster.

The Biggest Mistake: Frontend Reveals at Scale

Bad approach (frontend loop)

// ❌ Don't do this for bulk operations
for (const lead of leads) {
  const email = await fetch(`/api/reveal/${lead.domain}`);
  results.push(email);
}

Works for ~10 leads, but at 60 leads the browser times out, promises hang, and state corrupts.

Fix

Move the bulk operation server‑side and stream a single CSV file.

Result: 200 leads exported as a full CSV in 0.2 seconds. The browser never times out because it receives one streaming response.

Lesson: Never perform bulk data operations in frontend loops. Batch them server‑side.

Startup Cache Warming

Some domains (e.g., @sncf.fr) have thousands of records. The first query for such a domain was slow because ClickHouse had to scan many granules.

Solution: Warm a cache of the top 500 domains at startup in a background thread. After a ~2‑minute warm‑up, those domains return in under 5 ms from cache.

The Query That Killed Performance

Leading‑wildcard LIKE queries cannot use indexes and force a full scan.

-- SLOW – cannot use index
WHERE email LIKE '%@sncf.fr'

A faster, vectorized alternative:

-- FASTER – uses ClickHouse function
WHERE positionCaseInsensitive(email, '@sncf.fr') > 0

What I’d Do Differently

Design for server‑side bulk operations from day one.
Add query timeouts to every ClickHouse call, e.g., SETTINGS max_execution_time=5.
Use FINAL sparingly; ReplacingMergeTree requires it for deduplication, which adds overhead on large tables.

Live Project

The project is live at leadvault.to – 37 M B2B contacts searchable by country, job title, seniority, and company. Users receive 100 free credits daily.

How I Built a 37 Million Row Search Engine That Returns Results in 200ms

The Stack

Why ClickHouse Over Postgres

The Biggest Mistake: Frontend Reveals at Scale

Bad approach (frontend loop)

Fix

Startup Cache Warming

The Query That Killed Performance

What I’d Do Differently

Live Project

Related posts

How We Query 16.8M SIRENE Establishments in 66ms

How I Built a Mock API Platform Solo and Shipped It to Production

FastAPI vs Django vs Flask: Elegir el marco web Python correcto en 2026

Using lumitrace to eliminate redundant type conversions in Ruby