Reducing row count estimation errors in PostgreSQL
Source: Dev.to
Introduction
PostgreSQL’s query planner relies on table statistics to estimate the number of rows (estimated rows) each operation will process, and then selects an optimal execution plan based on these estimates. When the estimated rows diverge significantly from the actual rows, the planner can choose a sub‑optimal plan, leading to severe query‑performance degradation.
This article walks through four approaches I used to reduce row‑count estimation errors, ordered from least to most invasive.
Due to confidentiality constraints, I cannot share actual SQL or execution plans; the focus is on the diagnostic thought process and the techniques applied.
The approaches are applicable to any modern PostgreSQL version, as the underlying mechanisms (autovacuum, pg_statistic, extended statistics) have been stable across releases.
1. Make Statistics Fresh
The target table had a high update frequency. The first hypothesis was that the statistics were simply stale.
- In PostgreSQL, the autovacuum daemon automatically runs ANALYZE to update statistics stored in
pg_statistic. - For tables with heavy write activity, auto‑ANALYZE may not keep up, causing the statistics to drift from reality.
Adjust autovacuum‑ANALYZE frequency per table
Instead of changing the server‑wide settings in postgresql.conf, I tuned the two key parameters for the specific table:
| Parameter | Meaning | Default |
|---|---|---|
autovacuum_analyze_threshold | Minimum number of tuple modifications before auto‑ANALYZE is triggered | 50 |
autovacuum_analyze_scale_factor | Fraction of the table size added to the threshold | 0.1 (10 %) |
ALTER TABLE table_name SET (
autovacuum_analyze_threshold = 0,
autovacuum_analyze_scale_factor = 0.01
);
Setting autovacuum_analyze_threshold to 0 and reducing the scale factor to 0.01 forces auto‑ANALYZE after just 1 % of the table has been modified.
How to verify:
SELECT relname,
last_autoanalyze,
n_mod_since_analyze
FROM pg_stat_user_tables
WHERE relname = 'table_name';
last_autoanalyze– timestamp of the most recent auto‑ANALYZE.n_mod_since_analyze– number of row modifications since the last ANALYZE.
For the full list of per‑table storage parameters, see the PostgreSQL documentation on storage parameters.
2. Increase Sample Size
After ensuring statistics are fresh, the next hypothesis was that ANALYZE was running often enough but the sample size was too small to produce accurate statistics.
PostgreSQL’s ANALYZE collects samples from each column and stores:
- Most‑common values (MCVs)
- Histograms
The precision of this information is governed by default_statistics_target (default = 100), which controls the number of histogram buckets and MCV entries.
Raise the statistics target per column
ALTER TABLE table_name
ALTER COLUMN column_name SET STATISTICS 500;
- Guideline: set the target to 500–1000 for columns frequently used in
WHEREclauses. - Trade‑off: higher values increase ANALYZE execution time and enlarge
pg_statistic.
Note: After changing the statistics target with
SET STATISTICS, you must runANALYZE(or wait for the next auto‑ANALYZE) for the new setting to take effect.
3. Capture Column Correlations with Extended Statistics
Even with fresh and precise base statistics, row‑count estimation errors can persist because the planner’s estimation model has structural limitations.
By default, PostgreSQL assumes that conditions on different columns are independent. When this assumption is violated, the planner multiplies selectivities independently, often resulting in a dramatic underestimation of the actual row count.
When does this happen?
- A table has two columns
a1anda2. - There is a functional dependency between them (e.g.,
a1determinesa2). - The
WHEREclause includes conditions on both columns.
A concrete example: country and city. Knowing the country largely determines the set of possible cities. The planner treats the selectivities of each condition as independent, producing an estimate far lower than reality.
Create extended statistics
CREATE STATISTICS stat_name ON a1, a2 FROM table_name;
CREATE STATISTICS supports three kinds of statistics:
ndistinct– multi‑column distinct countsdependencies– functional dependenciesmcv– multi‑column most‑common values
If you omit the KIND clause, all three are collected, which is what I did as a starting point.
Note:
CREATE STATISTICSonly defines the statistics object. The actual statistics are not populated until ANALYZE runs on the table.
For more details, see the PostgreSQL documentation on CREATE STATISTICS.
4. Use Planner Hints (pg_hint_plan) – A Last‑Resort Option
When statistics‑based approaches are insufficient, the pg_hint_plan extension provides a way to directly control the planner’s behavior through SQL comment‑based hints.
Override row‑count estimates
/*+ Rows(table_name #1000) */ SELECT ...
#– set the estimate to an absolute value (here, 1 000).- You can also use
+,-, or*to add, subtract, or multiply the planner’s original estimate.
Drawbacks of hint‑based approaches
| Issue | Explanation |
|---|---|
| Fragile to data changes | Fixed row counts become inaccurate as data volumes evolve. |
| Reduced maintainability | Team members unfamiliar with hints may be confused. |
| Masks root causes | Hints can hide underlying statistics or schema issues that should be addressed. |
Recommendation: Use hints only after statistics‑based methods have been exhausted, or as a temporary measure while a more permanent fix is being implemented.
Investigating the Root Cause
This article covered four approaches to reducing row‑count estimation errors in PostgreSQL, ordered by increasing invasiveness:
- Tune autovacuum frequency – Are the statistics stale?
- Increase the statistics target – Is the sample size sufficient?
- Create extended statistics – Can the planner account for cross‑column correlations?
- Apply hint clauses – A last resort when statistics alone cannot solve the problem.
A Systematic Troubleshooting Workflow
When facing row‑estimation errors, a systematic approach works best:
- Run
EXPLAIN ANALYZE– Compare the planner’s estimated row counts with the actual row counts returned by the query. - Check statistics freshness – If they are outdated, adjust autovacuum settings or run
ANALYZEmanually. - Improve precision – Raise the
default_statistics_target(or per‑column targets) to give the planner a larger sample. - Address structural limitations – Use
CREATE STATISTICSto capture multi‑column correlations that the planner otherwise ignores. - Apply hint clauses – As a last resort, employ extensions such as
pg_hint_planto force specific join orders or scan methods.
I hope this article serves as a useful reference in your troubleshooting process.
References
- PostgreSQL documentation: Table storage parameters
- PostgreSQL documentation: CREATE STATISTICS
pg_hint_plan(GitHub): https://github.com/ossc-db/pg_hint_plan