The Deep Mechanics of Online Bulk Deletion in PostgreSQL

Published: (June 14, 2026 at 02:35 PM EDT)
17 min read
Source: Dev.to

Source: Dev.to

MVCC, WAL, vacuum, and replication slots under sustained delete load - and how to delete billions of rows without your database noticing

Most “how to delete a lot of rows” articles stop at “batch it and delete children before parents.” That advice is correct, it’s table stakes, and everyone already knows it. This article is about everything after that - the parts that actually decide whether your cleanup runs quietly in the background for a week or pages you at 3 a.m. with a full disk and a replica that’s six hours behind.

The thesis: at scale, your DELETE statement is the easy part. The adversaries are the subsystems a delete feeds - MVCC tuple versioning, the write-ahead log, autovacuum, and the replication machinery. Bulk deletion is really an exercise in flow control across those subsystems. Get the SQL right and the systems wrong, and you’ll still take production down.

We’ll assume PostgreSQL (the internals are PG-specific), a live OLTP primary with at least one physical replica and one or more logical/CDC consumers, and a target of hundreds of millions to billions of rows across many related tables.

The one paragraph of “basics,” so we can move on: delete in dependency order (referencing rows before referenced rows); collect parent keys once; never rely on ON DELETE CASCADE for huge deletes because you can’t throttle a cascade. Done. Now the real material.

  1. What a DELETE actually costs

A delete is not “remove a row.” Under MVCC it’s “mark a row version dead and write that fact everywhere.” For each deleted tuple, PostgreSQL:

Sets xmax on the heap tuple to your transaction id. The row is still physically present; it becomes a dead tuple once your transaction commits and no snapshot can still see it.

Writes a WAL record for the heap change. If this is the first modification of that page since the last checkpoint, it also writes a full-page image (FPI) - potentially 8 KB of WAL for a single-row change.

Touches every index. Index entries aren’t removed at delete time; they’re left dangling and cleaned up later by vacuum. So a table with 12 indexes generates index-cleanup debt across all 12.

Feeds logical decoding. If a logical replication slot exists, the change is decoded from WAL and streamed to consumers - and how much gets written depends on the table’s REPLICA IDENTITY (Section 5). Multiply by a billion. The deleted rows don’t shrink the table - they inflate it with dead tuples until vacuum reclaims them, and they inflate WAL, which every replica and slot must consume. The delete is cheap; the wake it leaves is expensive. Everything below is about managing that wake.

2. Batching is transaction-lifetime control, not just “smaller deletes”

People batch to avoid one long lock. True, but the deeper reason is snapshot lifetime, and it’s the single most important concept in this entire article. A transaction holds a snapshot. That snapshot defines an xmin horizon - the oldest transaction id still considered “in progress” from the system’s point of view. Two brutal consequences follow, and they’re the root cause of most bulk-delete disasters:

Vacuum cannot reclaim any dead tuple newer than the oldest snapshot’s xmin. A long-running transaction anywhere in the database freezes the cleanup horizon. So if your delete runs as one multi-hour transaction, autovacuum literally cannot clean up the dead tuples you’re producing - they pile up until you commit.

Logical replication slots cannot advance past an open transaction. More on this in Section 6, but the headline is the same: a long-lived transaction pins WAL on disk. So a billion-row delete in one transaction is a triple threat: a giant lock, an un-vacuumable mountain of dead tuples, and pinned WAL. Batching with per-batch COMMIT is how you keep the xmin horizon moving so vacuum and replication can do their jobs while you work.

 - The shape that matters: tiny transactions, monotonic cursor, durable progress.
LOOP
 SELECT max(id) INTO v_hi
 FROM (SELECT id FROM t WHERE owner=:o AND id > v_lo ORDER BY id LIMIT :n) s;
 EXIT WHEN v_hi IS NULL;
 DELETE FROM t WHERE owner=:o AND id > v_lo AND id  v_lo ORDER BY id LIMIT n`) rides the primary-key index and is **O(1) per batch regardless of depth**, unlike `OFFSET` which re-scans everything it skips. The `COMMIT` is doing far more than releasing locks - it's resetting your contribution to the global xmin horizon every few thousand rows.

`COMMIT` inside a loop requires a **procedure** (or anonymous `DO` block). Functions run inside the caller's transaction and cannot commit. This distinction will matter again in Section 7.

-
## 3. Dead tuples and the vacuum you must run yourself
Autovacuum is designed for steady-state churn, not for a deliberate billion-row purge. During a heavy delete you out-produce it, and three things go wrong: the table bloats, indexes bloat worse, and - if the run is long enough - you flirt with transaction-id wraparound pressure. You need to take vacuum into your own hands.
### 3.1 Why "just let autovacuum handle it" fails
Autovacuum triggers when dead tuples exceed `autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * reltuples`. With the default `scale_factor = 0.2`, a 2-billion-row table won't trigger autovacuum until **~400 million** dead tuples have accumulated. By then you have catastrophic bloat and a vacuum that takes hours and competes with your delete for I/O.
For a purge window, **tighten autovacuum on the target tables** for the duration, then reset:

ALTER TABLE big SET ( autovacuum_vacuum_scale_factor = 0.005, - fire at 0.5% dead, not 20% autovacuum_vacuum_threshold = 5000000, autovacuum_vacuum_cost_delay = 5, - throttled but not glacial autovacuum_vacuum_cost_limit = 2000 - ~10x default budget per cycle );  - … run the purge … ALTER TABLE big RESET (autovacuum_vacuum_scale_factor, autovacuum_vacuum_threshold, autovacuum_vacuum_cost_delay, autovacuum_vacuum_cost_limit);


    Enter fullscreen mode
    

    Exit fullscreen mode
    

  
  
  3.2 Vacuuming *during* the run, from inside a procedure

You also want explicit `VACUUM` checkpoints - say every N million rows deleted from a given table. Two PostgreSQL facts make this awkward:

**`VACUUM` cannot run inside a transaction block.** Your batched-delete procedure is always in a transaction, so you can't call it directly.
The clean workaround is to run it over a **separate connection**. `dblink` is the idiomatic trick - `dblink_exec` opens its own backend where `VACUUM` is a legitimate top-level statement:

PERFORM dblink_exec(:conn, format(‘VACUUM (ANALYZE) %I.%I’, :schema, :table));


    Enter fullscreen mode
    

    Exit fullscreen mode
    

And the scripting-layer trap that catches everyone: **`psql -c "stmt1; stmt2"` wraps both statements in one implicit transaction**, so a `VACUUM` after a `SET` fails with *"VACUUM cannot run inside a transaction block."* Use separate `-c` flags - each becomes its own transaction:

psql -c “SET statement_timeout=0” -c “VACUUM ANALYZE big” # works


    Enter fullscreen mode
    

    Exit fullscreen mode
    

  
  
  3.3 The horizon makes mid-run vacuum partly futile - unless you batch

Here's the payoff from Section 2: a `VACUUM` can only remove tuples older than the **global** oldest xmin. If your own delete is one long transaction, your mid-run vacuum reclaims almost nothing, because *your* snapshot is the horizon. Only because you batch-and-commit does mid-run vacuum actually free space. Vacuum and batching are not independent tactics - **batching is what makes vacuuming work at all.**

  
  
  3.4 Index bloat outlives table bloat

`VACUUM` makes dead space reusable but doesn't return it to the OS, and B-tree indexes bloat faster than heaps under massive deletes. After a large purge, estimate index bloat (the catalog/`pgstattuple` estimators) and `REINDEX INDEX CONCURRENTLY` the worst offenders - concurrently, so you don't take an `AccessExclusive` lock on a live table. Don't forget TOAST tables for wide columns; they bloat too.

  
  
  3.5 Wraparound, briefly

A days-long purge consumes transaction ids (one per batch commit, plus the deletes). It won't cause wraparound on its own, but combined with suppressed vacuuming it raises the table's age. Keep an eye on `age(relfrozenxid)` for the big tables if your purge runs for many days; the tightened autovacuum above also drives freezing.

-
## 4. WAL: the firehose you're pointing at your replicas
Every batch you delete becomes WAL, and WAL is the shared resource that couples your delete to *everything downstream*. Two amplifiers turn a reasonable delete into a WAL flood.
### 4.1 Full-page writes and checkpoints
The first write to a page after a checkpoint emits a full 8 KB page image. A delete that dirties many pages right after a checkpoint can emit far more WAL than the logical change suggests. You can't avoid FPIs, but you can avoid making them worse: don't run a purge concurrently with aggressive checkpoint settings, and spread the load (throttle) so checkpoints amortize. Watch `pg_stat_bgwriter` and the WAL generation rate.
### 4.2 `REPLICA IDENTITY FULL` is a WAL multiplier
This one silently doubles or triples your WAL and most people never check it. For logical decoding, PostgreSQL must write enough of the *old* row to identify it on the subscriber:

`REPLICA IDENTITY DEFAULT` → writes only the **primary key** of the deleted row. Cheap.

`REPLICA IDENTITY FULL` → writes the **entire old row**, every column, to WAL for each delete. On a wide table, that's a multi-hundred-byte WAL record per deleted row instead of a few bytes.
Tables without a primary key are often set to `FULL` (sometimes implicitly via tooling). If you're deleting a billion rows from such a table with a logical slot attached, you may write *an order of magnitude* more WAL than necessary. **Give big tables a real primary key and `REPLICA IDENTITY DEFAULT` before a large purge.** Measure WAL per batch (`pg_current_wal_lsn()` deltas) so amplification shows up immediately:

SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), :prev_lsn) AS wal_bytes_this_batch;


    Enter fullscreen mode
    

    Exit fullscreen mode
    

-
## 5. Replication slots: where bulk deletes go to die
This is the section that separates a clean purge from an outage. Your WAL has to be consumed by every replica and every replication slot before the primary can recycle it. If you produce WAL faster than they consume - or worse, if you *prevent* them from advancing - WAL accumulates on the primary's disk until it fills, and a full WAL volume is a hard production stop.
### 5.1 Two kinds of lag, and one of them is invisible

**Physical replicas** expose `replay_lag` (a time interval) in `pg_stat_replication`. Easy to see.

**Logical slots** (CDC, decoding-based replication) do **not** appear in `pg_stat_replication`. They live in `pg_replication_slots`, and their backlog is measured in **bytes of retained WAL**: `pg_current_wal_lsn() - confirmed_flush_lsn`. A stuck or slow logical consumer is the classic silent killer - it shows nothing in the obvious place while quietly pinning hundreds of gigabytes of WAL.

 - physical, seconds behind SELECT coalesce(max(extract(epoch FROM replay_lag))::int,0) FROM pg_stat_replication;  - logical, bytes of WAL retained (CHECK THIS - it’s the one that bites) SELECT coalesce(max(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)),0) FROM pg_replication_slots WHERE slot_type=‘logical’ AND active;


    Enter fullscreen mode
    

    Exit fullscreen mode
    

  
  
  5.2 `restart_lsn`, `confirmed_flush_lsn`, and the open-transaction trap

A logical slot tracks two LSNs: `confirmed_flush_lsn` (what the consumer has acknowledged) and `restart_lsn` (the oldest WAL the slot might still need if decoding restarts). The primary can only recycle WAL older than the **minimum `restart_lsn` across all slots**.

Now the trap. **`restart_lsn` cannot advance past the start of the oldest in-progress transaction**, because logical decoding emits whole transactions at commit and may need to re-read from a transaction's beginning. So an open transaction does two terrible things at once:

- It pins the **xmin horizon** → vacuum can't clean up (Section 3).

It pins **`restart_lsn`** → the primary retains all WAL written since that transaction began, by *every* session.
This is why a pause loop that **sleeps inside an open transaction is actively harmful**: you stop deleting to let consumers catch up, but your idle transaction's snapshot keeps `restart_lsn` frozen, so the WAL backlog *grows* during the pause instead of draining. You engineered a backpressure mechanism that makes backpressure worse.
The fix is one line, and it's the whole reason this needs to be a procedure:

WHILE lag_exceeds_threshold LOOP EXIT WHEN deadline_passed; - escape hatch, Section 7 COMMIT; - release snapshot BEFORE sleeping → restart_lsn can advance PERFORM pg_sleep(:recheck);  - re-measure lag END LOOP;


    Enter fullscreen mode
    

    Exit fullscreen mode
    

Commit, *then* sleep. Each pause iteration becomes a sub-10-second transaction, and `restart_lsn` is free to chase the consumers while you wait. There's also `catalog_xmin` on logical slots that pins catalog vacuuming similarly - same fix, same reasoning.

  
  
  5.3 Slot disk exhaustion is the actual failure mode

The incident isn't "replication is slow." It's "the primary's disk filled with retained WAL and the database stopped accepting writes." Set `max_slot_wal_keep_size` (PG 13+) as a safety valve so a dead slot gets invalidated rather than taking the primary down - and alert on per-slot retained bytes and on replication-instance disk headroom. A slow consumer should degrade your purge, never your primary.

-
## 6. Designing the backpressure controller
Put Sections 4–5 together into an explicit control loop that runs between batches. It's a feedback controller: measure lag, decide, act.

after each batch COMMIT: sleep(baseline_throttle) # steady-state pacing loop: # physical replica gate lag = max(replay_lag) if lag >= ALARM: warn(operator) # loud, no auto-abort if lag = ALARM: warn(operator) if backlog cursor` from a cold start returns only survivors:

SELECT id FROM child WHERE owner=:o AND id > :cursor ORDER BY id LIMIT :n;
Enter fullscreen mode


Exit fullscreen mode

EXISTS-guarded driver - if you must keep a separate key list, filter it to keys that still have rows:

SELECT k.id FROM key_list k
WHERE k.id > :cursor
 AND EXISTS (SELECT 1 FROM child c WHERE c.fk = k.id) - skip cleared keys
ORDER BY k.id LIMIT :n;
Enter fullscreen mode


Exit fullscreen mode

The planner difference is the whole game. The EXISTS becomes a semi-join: with an index on child.fk it’s an index-only probe per key (microseconds, no heap touch) or a merge semi-join between two sorted key streams - either way it skips millions of cleared keys at index speed instead of dragging each through a commit + throttle. Without an index on child.fk, that same EXISTS is a sequential scan of a billion-row table per probe - so the “fix” becomes a far worse disaster. Verify the index exists and check the EXPLAIN before trusting this at scale. Run EXPLAIN (ANALYZE, BUFFERS) on the real driver query against production-sized data; an Index Only Scan with single-digit heap fetches is what you want to see, not a Seq Scan.

7.3 Idempotency and audit

Record progress per (entity, table) in an audit table - rows found vs deleted, status, timestamp. Resume logic reads it to skip completed work; operators read it to answer “did it run, where is it, is it stuck.” Re-running a drained table must be a clean no-op. Treat interruption as the normal path.

8. Locking and online DDL around the purge

Bulk deletes themselves take only row locks, but the schema work around them (adding a PK to enable REPLICA IDENTITY DEFAULT, building an index for the EXISTS guard) needs care on a live table:

CREATE INDEX CONCURRENTLY - builds without an AccessExclusive lock (takes only ShareUpdateExclusive). Slower, online. Mandatory on hot tables.

SET NOT NULL without a full-table lock - the column-constraint route scans the table under AccessExclusive. On PG 12+ use the validated CHECK trick: add a CHECK (col IS NOT NULL) NOT VALID (instant, metadata-only), VALIDATE CONSTRAINT (online, ShareUpdateExclusive), then SET NOT NULL (now metadata-only because the validated check proves non-nullity), then drop the check.

lock_timeout + retry - wrap brief AccessExclusive operations (promoting a unique index to PK, switching replica identity) in a short lock_timeout with a retry loop, so they back off under contention instead of forming a lock queue that stalls the application.

idle_in_transaction_session_timeout - set it on the role running ad-hoc work near these tables, so an abandoned BEGIN in someone’s SQL console can’t pin your xmin horizon and restart_lsn for hours.

9. The escape hatch: don’t delete at all

The fastest delete is the one you never run. If the data you purge aligns with a time range or a tenant boundary, partition for deletion:

Range/list partition by the dimension you delete along (date, tenant, region). Then deletion is ALTER TABLE … DETACH PARTITION (instant, metadata-only) followed by DROP TABLE on the detached partition - O(1), no dead tuples, no WAL flood, no vacuum debt, no replica strain. A DROP TABLE writes a trivial amount of WAL versus a billion row-delete records. Partitioning has its own costs (planning overhead, constraint exclusion, cross-partition uniqueness), and it only helps if your deletion axis matches a partition key - you can’t partition by “arbitrary tenant scattered across an integer PK.” But when it fits, it turns a week-long throttled purge into a millisecond DROP. Always ask this question first; the rest of this article is what you do when the answer is “we can’t partition along that axis.”

10. Running it as a background job

The orchestration layer matters as much as the SQL:

Wall-clock deadline. Bound each run (MAX_RUN_HOURS); check it after every batch and at the top of every pause iteration; stop cleanly and let the next run resume. Bounds blast radius and prevents a run from colliding with the next scheduled one. Tip: make 0 an explicit “no deadline” sentinel for supervised runs, but guard every deadline comparison on > 0 first - otherwise MAX_RUN_HOURS=0 reads as “deadline is now” and aborts instantly.

No overlap. Before starting, assert no run is already in progress (advisory lock, or a “is the worker already running?” preflight). Two concurrent purges on the same tables double the WAL and contend on locks.

Dry-run mode. A preview pass that reports what would be deleted (counts per table) without deleting, so a misconfigured run fails safe.

Observability. Emit per-batch rows-deleted, both lag readings, WAL-per-batch, and vacuum events to your metrics pipeline. A dashboard of “WAL generated vs replica lag vs slot backlog” during a purge is worth more than any log.

11. The gotchas that cost a weekend

NULL = NULL is NULL, not true. Any anti-join or dedup on a nullable column silently mis-handles NULL rows on re-runs. Dedup on a guaranteed-non-null key.

Chunk only on a NOT NULL key. Slicing work by a nullable column corrupts your “next chunk” boundaries (NULLs sort last and get skipped or mangled). Confirm is_nullable = 'NO' in the catalog, not “it’s always populated in practice.”

Diagnostic queries match themselves. SELECT … FROM pg_stat_activity WHERE query LIKE '%vacuum%' finds itself. Add pid <> pg_backend_pid().

grep -q under set -o pipefail. grep -q exits on first match, SIGPIPEs the upstream producer (exit 141), and pipefail turns the pipeline non-zero - flipping your if ! logic. In shell glue, prefer pure-bash matching to printf … | grep -q. Stochastic and timing-dependent; it’ll pass on your laptop and fail in the container.

Stale planner stats after a big delete. Row-count estimates go stale; the next query may flip from an index scan to a seq scan. ANALYZE the big tables after the purge.

Snapshot drift in restore paths. If a recovery/copy path runs alongside, never SELECT * between two databases that can drift in column set/order - compute the explicit column intersection.

A pre-flight checklist

  • Can I partition-and-DROP instead of deleting? (Section 9 - ask first.)

  • Batched deletes, keyset cursor, per-batch COMMIT (procedure, not function)

  • Big tables have a real PK and REPLICA IDENTITY DEFAULT (not FULL)

[ ] Indexes on every EXISTS/join/chunk key - confirmed via EXPLAIN

  • Resume reads the target (or EXISTS-guards the key list); no static-list re-walk

  • Backpressure on both physical lag (sec) and logical slot backlog (bytes)

  • COMMIT before every pause-loop sleep; deadline escape hatch in each loop

  • Per-table autovacuum tuned for the window; mid-run VACUUM via separate connection

  • max_slot_wal_keep_size set; disk + per-slot retained-bytes alerts armed

[ ] Deadline, no-overlap guard, dry-run, audit table, metrics

Closing

A DELETE is a single line of SQL. A bulk delete on a live system is a distributed-systems problem wearing a SQL costume: you’re modulating the rate at which you feed WAL, dead tuples, and decoded transactions into a web of replicas, slots, and vacuum workers - none of which you control, all of which can take production down if you outrun them. The mental model that makes it tractable: keep the xmin horizon moving (batch + commit), keep WAL recyclable (commit before you pause, don’t write more than you must), and keep the slowest consumer in your control loop (measure both lags, pause without pinning). Everything else is detail. Delete slowly, commit often, watch your slots - and when you can, don’t delete at all; drop a partition.

0 views
Back to Blog

Related posts

Read more »