Reducing row count estimation errors in PostgreSQL

Published: (February 6, 2026 at 12:41 AM EST)
5 min read
Source: Dev.to

Source: Dev.to

Introduction

PostgreSQL’s query planner relies on table statistics to estimate the number of rows (estimated rows) each operation will process, and then selects an optimal execution plan based on these estimates. When the estimated rows diverge significantly from the actual rows, the planner can choose a sub‑optimal plan, leading to severe query‑performance degradation.

This article walks through four approaches I used to reduce row‑count estimation errors, ordered from least to most invasive.
Due to confidentiality constraints, I cannot share actual SQL or execution plans; the focus is on the diagnostic thought process and the techniques applied.

The approaches are applicable to any modern PostgreSQL version, as the underlying mechanisms (autovacuum, pg_statistic, extended statistics) have been stable across releases.

1. Make Statistics Fresh

The target table had a high update frequency. The first hypothesis was that the statistics were simply stale.

  • In PostgreSQL, the autovacuum daemon automatically runs ANALYZE to update statistics stored in pg_statistic.
  • For tables with heavy write activity, auto‑ANALYZE may not keep up, causing the statistics to drift from reality.

Adjust autovacuum‑ANALYZE frequency per table

Instead of changing the server‑wide settings in postgresql.conf, I tuned the two key parameters for the specific table:

ParameterMeaningDefault
autovacuum_analyze_thresholdMinimum number of tuple modifications before auto‑ANALYZE is triggered50
autovacuum_analyze_scale_factorFraction of the table size added to the threshold0.1 (10 %)
ALTER TABLE table_name SET (
    autovacuum_analyze_threshold = 0,
    autovacuum_analyze_scale_factor = 0.01
);

Setting autovacuum_analyze_threshold to 0 and reducing the scale factor to 0.01 forces auto‑ANALYZE after just 1 % of the table has been modified.

How to verify:

SELECT relname,
       last_autoanalyze,
       n_mod_since_analyze
FROM   pg_stat_user_tables
WHERE  relname = 'table_name';
  • last_autoanalyze – timestamp of the most recent auto‑ANALYZE.
  • n_mod_since_analyze – number of row modifications since the last ANALYZE.

For the full list of per‑table storage parameters, see the PostgreSQL documentation on storage parameters.

2. Increase Sample Size

After ensuring statistics are fresh, the next hypothesis was that ANALYZE was running often enough but the sample size was too small to produce accurate statistics.

PostgreSQL’s ANALYZE collects samples from each column and stores:

  • Most‑common values (MCVs)
  • Histograms

The precision of this information is governed by default_statistics_target (default = 100), which controls the number of histogram buckets and MCV entries.

Raise the statistics target per column

ALTER TABLE table_name
    ALTER COLUMN column_name SET STATISTICS 500;
  • Guideline: set the target to 500–1000 for columns frequently used in WHERE clauses.
  • Trade‑off: higher values increase ANALYZE execution time and enlarge pg_statistic.

Note: After changing the statistics target with SET STATISTICS, you must run ANALYZE (or wait for the next auto‑ANALYZE) for the new setting to take effect.

3. Capture Column Correlations with Extended Statistics

Even with fresh and precise base statistics, row‑count estimation errors can persist because the planner’s estimation model has structural limitations.

By default, PostgreSQL assumes that conditions on different columns are independent. When this assumption is violated, the planner multiplies selectivities independently, often resulting in a dramatic underestimation of the actual row count.

When does this happen?

  • A table has two columns a1 and a2.
  • There is a functional dependency between them (e.g., a1 determines a2).
  • The WHERE clause includes conditions on both columns.

A concrete example: country and city. Knowing the country largely determines the set of possible cities. The planner treats the selectivities of each condition as independent, producing an estimate far lower than reality.

Create extended statistics

CREATE STATISTICS stat_name ON a1, a2 FROM table_name;

CREATE STATISTICS supports three kinds of statistics:

  • ndistinct – multi‑column distinct counts
  • dependencies – functional dependencies
  • mcv – multi‑column most‑common values

If you omit the KIND clause, all three are collected, which is what I did as a starting point.

Note: CREATE STATISTICS only defines the statistics object. The actual statistics are not populated until ANALYZE runs on the table.

For more details, see the PostgreSQL documentation on CREATE STATISTICS.

4. Use Planner Hints (pg_hint_plan) – A Last‑Resort Option

When statistics‑based approaches are insufficient, the pg_hint_plan extension provides a way to directly control the planner’s behavior through SQL comment‑based hints.

Override row‑count estimates

/*+ Rows(table_name #1000) */ SELECT ...
  • # – set the estimate to an absolute value (here, 1 000).
  • You can also use +, -, or * to add, subtract, or multiply the planner’s original estimate.

Drawbacks of hint‑based approaches

IssueExplanation
Fragile to data changesFixed row counts become inaccurate as data volumes evolve.
Reduced maintainabilityTeam members unfamiliar with hints may be confused.
Masks root causesHints can hide underlying statistics or schema issues that should be addressed.

Recommendation: Use hints only after statistics‑based methods have been exhausted, or as a temporary measure while a more permanent fix is being implemented.

Investigating the Root Cause

This article covered four approaches to reducing row‑count estimation errors in PostgreSQL, ordered by increasing invasiveness:

  1. Tune autovacuum frequency – Are the statistics stale?
  2. Increase the statistics target – Is the sample size sufficient?
  3. Create extended statistics – Can the planner account for cross‑column correlations?
  4. Apply hint clauses – A last resort when statistics alone cannot solve the problem.

A Systematic Troubleshooting Workflow

When facing row‑estimation errors, a systematic approach works best:

  1. Run EXPLAIN ANALYZE – Compare the planner’s estimated row counts with the actual row counts returned by the query.
  2. Check statistics freshness – If they are outdated, adjust autovacuum settings or run ANALYZE manually.
  3. Improve precision – Raise the default_statistics_target (or per‑column targets) to give the planner a larger sample.
  4. Address structural limitations – Use CREATE STATISTICS to capture multi‑column correlations that the planner otherwise ignores.
  5. Apply hint clauses – As a last resort, employ extensions such as pg_hint_plan to force specific join orders or scan methods.

I hope this article serves as a useful reference in your troubleshooting process.

References

Back to Blog

Related posts

Read more »