How I Built a 37 Million Row Search Engine That Returns Results in 200ms

Published: (March 9, 2026 at 06:45 PM EDT)
3 min read
Source: Dev.to

Source: Dev.to

Cover image for How I Built a 37 Million Row Search Engine That Returns Results in 200ms

Published on dev.to • leadvault.to

I got tired of paying $99 / month for B2B lead databases, so I built my own. What started as a side project turned into a production system handling 37 million records with sub‑200 ms query times. Below is how I did it, what broke along the way, and what I’d do differently.

The Stack

  • ClickHouse – columnar database, the core of everything
  • FastAPI – Python backend, 6 uvicorn workers
  • Next.js – frontend
  • Supabase – auth and user credits
  • AWS EC2 m7i‑flex.large – 2 vCPU, 8 GB RAM

Why ClickHouse Over Postgres

The dataset: 37 million rows, ~15 columns (name, email, job title, company, location, seniority, etc.).

Typical query:

SELECT first_name,
       last_name,
       email_domain,
       job_title,
       company,
       location_city,
       location_country,
       seniority
FROM leads_clean
WHERE location_country = 'United States'
  AND seniority = 'manager'
LIMIT 100;
  • Postgres (with proper indexes): 8–15 seconds.
  • ClickHouse: 80–200 ms.

ClickHouse reads only the columns referenced in the query, making analytics‑style filters across many columns dramatically faster.

The Biggest Mistake: Frontend Reveals at Scale

Bad approach (frontend loop)

// ❌ Don't do this for bulk operations
for (const lead of leads) {
  const email = await fetch(`/api/reveal/${lead.domain}`);
  results.push(email);
}

Works for ~10 leads, but at 60 leads the browser times out, promises hang, and state corrupts.

Fix

Move the bulk operation server‑side and stream a single CSV file.

Result: 200 leads exported as a full CSV in 0.2 seconds. The browser never times out because it receives one streaming response.

Lesson: Never perform bulk data operations in frontend loops. Batch them server‑side.

Startup Cache Warming

Some domains (e.g., @sncf.fr) have thousands of records. The first query for such a domain was slow because ClickHouse had to scan many granules.

Solution: Warm a cache of the top 500 domains at startup in a background thread. After a ~2‑minute warm‑up, those domains return in under 5 ms from cache.

The Query That Killed Performance

Leading‑wildcard LIKE queries cannot use indexes and force a full scan.

-- SLOW – cannot use index
WHERE email LIKE '%@sncf.fr'

A faster, vectorized alternative:

-- FASTER – uses ClickHouse function
WHERE positionCaseInsensitive(email, '@sncf.fr') > 0

What I’d Do Differently

  • Design for server‑side bulk operations from day one.
  • Add query timeouts to every ClickHouse call, e.g., SETTINGS max_execution_time=5.
  • Use FINAL sparingly; ReplacingMergeTree requires it for deduplication, which adds overhead on large tables.

Live Project

The project is live at leadvault.to – 37 M B2B contacts searchable by country, job title, seniority, and company. Users receive 100 free credits daily.

0 views
Back to Blog

Related posts

Read more »

Meilisearch Python setup

Docker Compose Service yaml chatcodfiscal-meilisearch: image: getmeili/meilisearch:v1.16 check for the latest version container_name: chatcodfiscal-meilisearch...