EP 6.1: How to Search 1 Billion Rows in Milliseconds
Source: Dev.to
The Problem: The Full Table Scan
Imagine a library with 1 million books. If you want to find The Great Gatsby and the books aren’t organized, you have to look at every single book one by one until you find it. In database terms, this is a full table scan (linear time complexity).
As your database grows from thousands to millions of rows, your SELECT queries will slow down to a crawl, eventually timing out and crashing your application.
Indexes
An index is a separate data structure that stores a sorted version of a column’s data alongside a pointer (reference) to the actual row in the main table. Most relational databases (like PostgreSQL and MySQL) use a B‑Tree (balanced tree) structure.
How a B‑Tree Index Works
- Root – The search starts at the top node.
- Decision – The system compares your value (e.g.,
User ID = 505) to the values in the node and decides whether to go left or right. - Leaf – You reach the bottom of the tree, which contains the exact location of the data on disk.
This turns a search from O(n) into O(log n). Searching 1 million rows suddenly only takes about 20 “jumps.”
Analogy
Think of the index like the index at the back of a textbook:
- You look for the keyword “Load Balancing.”
- The index tells you it’s on page 42.
- You jump directly to page 42 without reading the previous 41 pages.
Types of Indexes
Clustered Index (Primary Key)
- The table itself is physically sorted in this order.
- You can have only one clustered index per table.
Non‑Clustered Indexes
- Additional indexes you create on columns like
emailorcreated_at. - They live in a separate “folder” from your main data and point back to the clustered index.
Composite Indexes
If your query always looks like:
WHERE country = 'India' AND city = 'Mumbai'
a single index covering both columns is much faster than the database trying to merge two separate indexes.
Pro tip: Order matters in composite indexes. An index on (city, country) is not the same as (country, city).
Trade‑offs
- Read vs. Write: Indexes speed up reads (
SELECT) but slow down writes (INSERT/UPDATE/DELETE). - Write Penalty: Every time you add a new row, the database must find the correct spot in every B‑Tree associated with that table and rebalance the tree.
- Storage Cost: Indexes take up disk space; on massive tables, an index can sometimes be larger than the data itself.
Practical Guidelines
- Index Foreign Keys: This makes
JOINoperations significantly faster. - Covering Queries: If your index contains all the data the
SELECTneeds (e.g., you indexemailand only selectemail), the database doesn’t even have to look at the main table. This is called an index‑only scan. - Avoid Low‑Cardinality Columns: Indexing a column like
gender(values onlyM/F/O) is often useless because the database still has to scan a large portion of the table. - Monitor with
EXPLAIN ANALYZE: Before adding an index, run your query withEXPLAINto see if the database is performing a sequential scan or using an existing index.
Best Practices
Indexes are the most powerful tool for database performance, but use them like salt: too much will ruin the dish.
- Start by indexing columns used in
WHEREclauses,JOINconditions, andORDER BYstatements. - Monitor your write latency and adjust as needed.