How I Built a 37 Million Row Search Engine That Returns Results in 200ms
Source: Dev.to

Published on dev.to • leadvault.to
I got tired of paying $99 / month for B2B lead databases, so I built my own. What started as a side project turned into a production system handling 37 million records with sub‑200 ms query times. Below is how I did it, what broke along the way, and what I’d do differently.
The Stack
- ClickHouse – columnar database, the core of everything
- FastAPI – Python backend, 6 uvicorn workers
- Next.js – frontend
- Supabase – auth and user credits
- AWS EC2 m7i‑flex.large – 2 vCPU, 8 GB RAM
Why ClickHouse Over Postgres
The dataset: 37 million rows, ~15 columns (name, email, job title, company, location, seniority, etc.).
Typical query:
SELECT first_name,
last_name,
email_domain,
job_title,
company,
location_city,
location_country,
seniority
FROM leads_clean
WHERE location_country = 'United States'
AND seniority = 'manager'
LIMIT 100;
- Postgres (with proper indexes): 8–15 seconds.
- ClickHouse: 80–200 ms.
ClickHouse reads only the columns referenced in the query, making analytics‑style filters across many columns dramatically faster.
The Biggest Mistake: Frontend Reveals at Scale
Bad approach (frontend loop)
// ❌ Don't do this for bulk operations
for (const lead of leads) {
const email = await fetch(`/api/reveal/${lead.domain}`);
results.push(email);
}
Works for ~10 leads, but at 60 leads the browser times out, promises hang, and state corrupts.
Fix
Move the bulk operation server‑side and stream a single CSV file.
Result: 200 leads exported as a full CSV in 0.2 seconds. The browser never times out because it receives one streaming response.
Lesson: Never perform bulk data operations in frontend loops. Batch them server‑side.
Startup Cache Warming
Some domains (e.g., @sncf.fr) have thousands of records. The first query for such a domain was slow because ClickHouse had to scan many granules.
Solution: Warm a cache of the top 500 domains at startup in a background thread. After a ~2‑minute warm‑up, those domains return in under 5 ms from cache.
The Query That Killed Performance
Leading‑wildcard LIKE queries cannot use indexes and force a full scan.
-- SLOW – cannot use index
WHERE email LIKE '%@sncf.fr'
A faster, vectorized alternative:
-- FASTER – uses ClickHouse function
WHERE positionCaseInsensitive(email, '@sncf.fr') > 0
What I’d Do Differently
- Design for server‑side bulk operations from day one.
- Add query timeouts to every ClickHouse call, e.g.,
SETTINGS max_execution_time=5. - Use
FINALsparingly;ReplacingMergeTreerequires it for deduplication, which adds overhead on large tables.
Live Project
The project is live at leadvault.to – 37 M B2B contacts searchable by country, job title, seniority, and company. Users receive 100 free credits daily.