How We Query 16.8M SIRENE Establishments in 66ms
Source: Dev.to
The Challenge
Our establishment table has 16.8 million rows. Users need to search by:
| Field | Type | Notes |
|---|---|---|
| SIREN | 9 digits | Exact match – trivial with a B‑tree index |
| SIRET | 14 digits | Exact match – same |
| Company name | Text | Fuzzy match – the interesting part |
The name search must handle:
- Partial matches – “Total” should find “TotalEnergies SE”
- Typos – “Miclein” should find “Michelin”
- Accent insensitivity – “Societe Generale” should match “Société Générale”
The Naïve Approach: ILIKE
SELECT *
FROM georefer.establishment
WHERE company_name ILIKE '%total%'
LIMIT 25;
EXPLAIN ANALYZE
Seq Scan on establishment
Filter: (company_name ~~* '%total%')
Rows Removed by Filter: 16799975
Planning Time: 0.1ms
Execution Time: 12,847ms
12.8 seconds → a full sequential scan on 16.8 M rows. Unusable.
Enter pg_trgm
PostgreSQL’s pg_trgm extension breaks strings into trigrams (3‑character sequences) and uses GIN indexes to find similar strings efficiently.
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX idx_establishment_name_trgm
ON georefer.establishment
USING GIN (company_name gin_trgm_ops);
Now with trigram similarity:
SELECT *,
similarity(company_name, 'total') AS sim
FROM georefer.establishment
WHERE company_name % 'total'
ORDER BY sim DESC
LIMIT 25;
EXPLAIN ANALYZE
Bitmap Heap Scan on establishment
Recheck Cond: (company_name % 'total')
-> Bitmap Index Scan on idx_establishment_name_trgm
Index Cond: (company_name % 'total')
Planning Time: 0.3ms
Execution Time: 66ms
66 ms → 194× improvement over the naïve approach.
The Import Strategy: 3 Phases
Importing 16.8 M rows isn’t straightforward. We use a three‑phase approach.
Phase 1 – Schema + Staging
CREATE TABLE georefer.establishment (
id SERIAL PRIMARY KEY,
siren VARCHAR(9) NOT NULL,
siret VARCHAR(14) NOT NULL UNIQUE,
company_name VARCHAR(255),
commercial_name VARCHAR(255),
legal_form VARCHAR(10),
naf_code VARCHAR(6),
employee_range VARCHAR(5),
postal_code VARCHAR(5),
city VARCHAR(100),
department_code VARCHAR(3),
is_headquarters BOOLEAN DEFAULT FALSE,
is_active BOOLEAN DEFAULT TRUE,
created_date DATE,
last_update DATE
);
Phase 2 – Bulk COPY
COPY georefer.establishment
(siren, siret, company_name, ...)
FROM '/tmp/sirene_active.csv'
WITH (FORMAT csv, HEADER true, DELIMITER ',');
COPY is 10–50× faster than batch INSERT. The 16.8 M rows load in about 8 minutes.
Phase 3 – Index Creation
Create indexes after the bulk import (building them beforehand would slow the load).
-- Exact lookups
CREATE INDEX idx_establishment_siren ON georefer.establishment(siren);
CREATE INDEX idx_establishment_siret ON georefer.establishment(siret);
-- Geographic filtering
CREATE INDEX idx_establishment_postal ON georefer.establishment(postal_code);
CREATE INDEX idx_establishment_dept ON georefer.establishment(department_code);
CREATE INDEX idx_establishment_city ON georefer.establishment(city);
-- Fuzzy name search
CREATE INDEX idx_establishment_naf ON georefer.establishment(naf_code);
CREATE INDEX idx_establishment_name_trgm
ON georefer.establishment USING GIN (company_name gin_trgm_ops);
Combined Query: Name + Geographic Filter
SELECT *,
similarity(company_name, 'boulangerie') AS sim
FROM georefer.establishment
WHERE company_name % 'boulangerie'
AND department_code = '75'
AND is_active = true
ORDER BY sim DESC
LIMIT 25;
Result: all bakeries in Paris in ~45 ms, even across 16.8 M rows.
The API Layer
The Spring Boot service exposes the search via REST.
# Search by SIREN
curl 'https://georefer.io/geographical_repository/v1/companies?siren=552120222' \
-H 'X-Georefer-API-Key: YOUR_API_KEY'
# Search by name + department
curl 'https://georefer.io/geographical_repository/v1/companies/search?name=michelin&department_code=63' \
-H 'X-Georefer-API-Key: YOUR_API_KEY'
Sample JSON response
{
"success": true,
"data": [
{
"siren": "855200507",
"siret": "85520050700046",
"company_name": "MANUFACTURE FRANCAISE DES PNEUMATIQUES MICHELIN",
"naf_code": "22.11Z",
"employee_range": "5000+",
"postal_code": "63000",
"city": "CLERMONT-FERRAND",
"is_headquarters": true
}
]
}
Performance Summary
| Query Type | Before (no index) | After (pg_trgm) |
|---|---|---|
Simple ILIKE on name | 12,847 ms | 66 ms |
| Name + department filter | ~12 s (full scan) | ~45 ms |
| Exact SIREN / SIRET lookup | µs (B‑tree) | µs (B‑tree) |
Numbers are illustrative; actual timings may vary with hardware and load.
Improvement
| Query | Total Time | Avg. Latency | Speed‑up |
|---|---|---|---|
| Name search | 12,847 ms | 66 ms | 194× |
| Name + dept filter | 13,102 ms | 45 ms | 291× |
| SIREN exact | 8,200 ms | 0.3 ms | 27,333× |
| SIRET exact | 8,150 ms | 0.2 ms | 40,750× |
Lessons Learned
- Always create indexes after bulk import – creating them before can make the import 10× slower.
pg_trgmGIN indexes use a lot of disk – our 16.8 M‑row trigram index is ~2.3 GB.- Set
maintenance_work_memhigh during index creation –SET maintenance_work_mem = '1GB'cuts index creation time in half. COPYbeatsINSERTevery time for bulk loading – useCOPYfor anything over 10 K rows.
Try It
GEOREFER exposes 16.8 M SIRENE establishments through a simple REST API:
- Free tier: 100 req/day, no credit card required.
- Docs:
- Sign up:
AZMORIS Engineering – “Software that Endures”