How We Query 16.8M SIRENE Establishments in 66ms

Published: 1 month ago (March 7, 2026 at 07:25 AM EST)

4 min read

Source: Dev.to

Source: Dev.to

The Challenge

Our establishment table has 16.8 million rows. Users need to search by:

Field	Type	Notes
SIREN	9 digits	Exact match – trivial with a B‑tree index
SIRET	14 digits	Exact match – same
Company name	Text	Fuzzy match – the interesting part

The name search must handle:

Partial matches – “Total” should find “TotalEnergies SE”
Typos – “Miclein” should find “Michelin”
Accent insensitivity – “Societe Generale” should match “Société Générale”

The Naïve Approach: `ILIKE`

SELECT *
FROM georefer.establishment
WHERE company_name ILIKE '%total%'
LIMIT 25;

EXPLAIN ANALYZE

Seq Scan on establishment
  Filter: (company_name ~~* '%total%')
  Rows Removed by Filter: 16799975
  Planning Time: 0.1ms
  Execution Time: 12,847ms

12.8 seconds → a full sequential scan on 16.8 M rows. Unusable.

Enter `pg_trgm`

PostgreSQL’s pg_trgm extension breaks strings into trigrams (3‑character sequences) and uses GIN indexes to find similar strings efficiently.

CREATE EXTENSION IF NOT EXISTS pg_trgm;

CREATE INDEX idx_establishment_name_trgm
ON georefer.establishment
USING GIN (company_name gin_trgm_ops);

Now with trigram similarity:

SELECT *,
       similarity(company_name, 'total') AS sim
FROM georefer.establishment
WHERE company_name % 'total'
ORDER BY sim DESC
LIMIT 25;

EXPLAIN ANALYZE

Bitmap Heap Scan on establishment
  Recheck Cond: (company_name % 'total')
  -> Bitmap Index Scan on idx_establishment_name_trgm
       Index Cond: (company_name % 'total')
  Planning Time: 0.3ms
  Execution Time: 66ms

66 ms → 194× improvement over the naïve approach.

The Import Strategy: 3 Phases

Importing 16.8 M rows isn’t straightforward. We use a three‑phase approach.

Phase 1 – Schema + Staging

CREATE TABLE georefer.establishment (
    id               SERIAL PRIMARY KEY,
    siren            VARCHAR(9)  NOT NULL,
    siret            VARCHAR(14) NOT NULL UNIQUE,
    company_name     VARCHAR(255),
    commercial_name  VARCHAR(255),
    legal_form       VARCHAR(10),
    naf_code         VARCHAR(6),
    employee_range   VARCHAR(5),
    postal_code      VARCHAR(5),
    city             VARCHAR(100),
    department_code  VARCHAR(3),
    is_headquarters  BOOLEAN DEFAULT FALSE,
    is_active        BOOLEAN DEFAULT TRUE,
    created_date     DATE,
    last_update      DATE
);

Phase 2 – Bulk `COPY`

COPY georefer.establishment
    (siren, siret, company_name, ...)
FROM '/tmp/sirene_active.csv'
WITH (FORMAT csv, HEADER true, DELIMITER ',');

COPY is 10–50× faster than batch INSERT. The 16.8 M rows load in about 8 minutes.

Phase 3 – Index Creation

Create indexes after the bulk import (building them beforehand would slow the load).

-- Exact lookups
CREATE INDEX idx_establishment_siren ON georefer.establishment(siren);
CREATE INDEX idx_establishment_siret ON georefer.establishment(siret);

-- Geographic filtering
CREATE INDEX idx_establishment_postal ON georefer.establishment(postal_code);
CREATE INDEX idx_establishment_dept   ON georefer.establishment(department_code);
CREATE INDEX idx_establishment_city   ON georefer.establishment(city);

-- Fuzzy name search
CREATE INDEX idx_establishment_naf   ON georefer.establishment(naf_code);
CREATE INDEX idx_establishment_name_trgm
    ON georefer.establishment USING GIN (company_name gin_trgm_ops);

Combined Query: Name + Geographic Filter

SELECT *,
       similarity(company_name, 'boulangerie') AS sim
FROM georefer.establishment
WHERE company_name % 'boulangerie'
  AND department_code = '75'
  AND is_active = true
ORDER BY sim DESC
LIMIT 25;

Result: all bakeries in Paris in ~45 ms, even across 16.8 M rows.

The API Layer

The Spring Boot service exposes the search via REST.

# Search by SIREN
curl 'https://georefer.io/geographical_repository/v1/companies?siren=552120222' \
  -H 'X-Georefer-API-Key: YOUR_API_KEY'

# Search by name + department
curl 'https://georefer.io/geographical_repository/v1/companies/search?name=michelin&department_code=63' \
  -H 'X-Georefer-API-Key: YOUR_API_KEY'

Sample JSON response

{
  "success": true,
  "data": [
    {
      "siren": "855200507",
      "siret": "85520050700046",
      "company_name": "MANUFACTURE FRANCAISE DES PNEUMATIQUES MICHELIN",
      "naf_code": "22.11Z",
      "employee_range": "5000+",
      "postal_code": "63000",
      "city": "CLERMONT-FERRAND",
      "is_headquarters": true
    }
  ]
}

Performance Summary

Query Type	Before (no index)	After (`pg_trgm`)
Simple `ILIKE` on name	12,847 ms	66 ms
Name + department filter	~12 s (full scan)	~45 ms
Exact SIREN / SIRET lookup	µs (B‑tree)	µs (B‑tree)

Numbers are illustrative; actual timings may vary with hardware and load.

Improvement

Query	Total Time	Avg. Latency	Speed‑up
Name search	12,847 ms	66 ms	194×
Name + dept filter	13,102 ms	45 ms	291×
SIREN exact	8,200 ms	0.3 ms	27,333×
SIRET exact	8,150 ms	0.2 ms	40,750×

Lessons Learned

Always create indexes after bulk import – creating them before can make the import 10× slower.
pg_trgm GIN indexes use a lot of disk – our 16.8 M‑row trigram index is ~2.3 GB.
Set maintenance_work_mem high during index creation – SET maintenance_work_mem = '1GB' cuts index creation time in half.
COPY beats INSERT every time for bulk loading – use COPY for anything over 10 K rows.

Try It

GEOREFER exposes 16.8 M SIRENE establishments through a simple REST API:

Free tier: 100 req/day, no credit card required.
Docs:
Sign up:

AZMORIS Engineering – “Software that Endures”

How We Query 16.8M SIRENE Establishments in 66ms

The Challenge

The Naïve Approach: `ILIKE`

Enter `pg_trgm`

The Import Strategy: 3 Phases

Phase 1 – Schema + Staging

Phase 2 – Bulk `COPY`

Phase 3 – Index Creation

Combined Query: Name + Geographic Filter

The API Layer

Performance Summary

Improvement

Lessons Learned

Try It

Related posts

How I Built a 37 Million Row Search Engine That Returns Results in 200ms

Elasticsearch vs RDBMS, Logstash vs Fluentd, Elasticsearch vs Opensearch

TimescaleDB Continuous Aggregates: Real-Time vs Materialized-Only

🚀 Building My First Backend Project: Expense Tracker API

The Challenge

The Naïve Approach: ILIKE

Enter pg_trgm

The Import Strategy: 3 Phases

Phase 1 – Schema + Staging

Phase 2 – Bulk COPY

Phase 3 – Index Creation

Combined Query: Name + Geographic Filter

The API Layer

Performance Summary

Improvement

Lessons Learned

Try It

Related posts

How I Built a 37 Million Row Search Engine That Returns Results in 200ms

Elasticsearch vs RDBMS, Logstash vs Fluentd, Elasticsearch vs Opensearch

TimescaleDB Continuous Aggregates: Real-Time vs Materialized-Only

🚀 Building My First Backend Project: Expense Tracker API

The Naïve Approach: `ILIKE`

Enter `pg_trgm`

Phase 1 – Schema + Staging

Phase 2 – Bulk `COPY`

Phase 3 – Index Creation