Module 4·41 min read

How a stalled replication slot pinned OldestXmin for 48 hours, accumulating 800 million dead tuples — and the autovacuum mechanics that let it happen.

Module 4 — Autovacuum: The Process Everyone Misconfigures

What this module covers: Autovacuum is not a background cleanup job you can ignore. It is the mechanism that prevents your tables from bloating to 10x their logical size, your indexes from degrading, your standbys from lagging, and — in the worst case — your entire database from shutting down due to transaction ID wraparound. This module covers how autovacuum actually works, what it does to your data, why its defaults are wrong for most production workloads, and how to tune it from first principles.

The Problem Autovacuum Solves

In Module 2, you learned that MVCC creates dead tuples. Every UPDATE marks the old tuple dead and writes a new one. Every DELETE marks the tuple dead. These dead tuples remain physically present in the heap — they occupy pages, bloat indexes, and slow down sequential scans — until something reclaims them.

That something is VACUUM.

Without VACUUM, a table with a high update rate grows without bound. Not because the number of live rows is growing, but because dead tuples accumulate. A table with 1 million live rows that gets updated 10 times per row will physically contain up to 10 million tuples on disk.

Autovacuum is the background process that runs VACUUM automatically so you don't have to schedule it manually. This sounds simple. The complexity is in the details:

When does autovacuum decide a table needs vacuuming?
How aggressively does it work while it's running?
What exactly does it do during a vacuum pass?
What does it not do that surprises most engineers?
How do you tune it for a specific workload?

What VACUUM Actually Does

Before tuning autovacuum, you need a precise model of what a VACUUM pass does. Most engineers think of it as "cleans up dead tuples." The reality is a multi-step process that touches heap files, indexes, the Free Space Map, and the Visibility Map.

Step 1: Scan the Heap for Dead Tuples

VACUUM reads every page of the heap file sequentially, examining tuple headers. For each tuple, it checks visibility: is this tuple dead to all current and future transactions?

A tuple is dead-to-all if:

Its xmax is set (it has been deleted or updated)
The transaction that set xmax has committed
That transaction's XID is older than the oldest active transaction snapshot (OldestXmin)

Any tuple older than OldestXmin that is deleted cannot be seen by any current or future transaction. It is safe to reclaim.

sql
-- See what OldestXmin currently is
SELECT backend_xmin
FROM pg_stat_activity
WHERE backend_xmin IS NOT NULL
ORDER BY backend_xmin
LIMIT 1;

-- A long-running transaction keeps OldestXmin old, preventing dead tuple reclamation
-- This is why idle-in-transaction sessions cause bloat

Step 2: Build the Dead Tuple TID List

VACUUM builds an in-memory list of tuple IDs (TIDs — page number + offset within page) for all dead tuples found in step 1. This list is bounded by maintenance_work_mem — if there are more dead tuples than fit in memory, VACUUM processes them in multiple passes.

ini
# How much memory VACUUM can use for the dead tuple list
maintenance_work_mem = 256MB  # default is 64MB — increase for large tables

With the default 64MB, VACUUM can hold approximately 1 million dead tuple TIDs per pass (each TID is 6 bytes). A table with 10 million dead tuples requires ~10 passes, each requiring a full re-scan of the heap. Increasing maintenance_work_mem to 512MB or 1GB for autovacuum workers dramatically reduces the number of heap passes on large tables.

Step 3: Remove Dead TIDs from Indexes

For every dead tuple found, VACUUM must remove its entry from every index on the table. This is often the most expensive part of VACUUM on heavily-indexed tables.

VACUUM scans each index, finds entries pointing to dead TIDs, and removes them. Index pages that become completely empty after cleanup can be reclaimed.

This is why the number of indexes on a table directly affects VACUUM cost. A table with 10 indexes requires VACUUM to do 10 index scans per heap pass.

Step 4: Reclaim Dead Tuple Space in the Heap

With the index entries removed, VACUUM returns to the heap pages and marks the dead tuple space as free. It updates the Free Space Map (FSM) for each page to reflect the newly available space.

Critically: VACUUM does not move live tuples or compact pages. If a page has alternating live and dead tuples scattered throughout it, VACUUM marks the dead tuple slots as free but leaves the live tuples exactly where they are. The page now has holes — free space interspersed with live tuples.

New inserts can use this free space (via the FSM), but the page count of the table does not decrease. This is why VACUUM does not shrink a table's physical file. Only VACUUM FULL (which rewrites the entire table) shrinks it — but VACUUM FULL takes an exclusive lock and is rarely the right tool.

Step 5: Update the Visibility Map

The Visibility Map (VM) is a compact bitmap, one bit per heap page, that records whether all tuples on a page are visible to all transactions. If a page's VM bit is set, index-only scans can skip the heap entirely.

VACUUM sets VM bits for pages where all tuples are now visible to all transactions (all dead tuples have been reclaimed, all live tuples are committed and old enough). This is one of the most important outcomes of a VACUUM pass — without it, index-only scans cannot function correctly.

Step 6: Update pg_class Statistics

VACUUM updates pg_class.relpages and pg_class.reltuples with the current page count and live tuple estimate. The query planner uses these values. Stale statistics here cause wrong cardinality estimates and bad plans.

What VACUUM Does NOT Do

Does not shrink the physical file — reclaimed space is reused by future inserts, not returned to the OS
Does not remove all dead tuples — tuples newer than OldestXmin are not dead yet; they may be visible to open transactions
Does not reindex — indexes accumulate bloat over time; REINDEX or VACUUM (INDEX_CLEANUP) is separate
Does not update column statistics — that is ANALYZE's job; VACUUM ANALYZE runs both

Autovacuum Architecture

The Autovacuum Launcher

A single autovacuum launcher process runs continuously. Its job is to monitor tables and decide when to trigger a vacuum worker. It checks each database in round-robin fashion, looking at pg_stat_user_tables to find tables that exceed the vacuum or analyze threshold.

Autovacuum Workers

When the launcher decides a table needs vacuuming, it spawns an autovacuum worker process. Workers are capped at autovacuum_max_workers (default: 3).

This is the first misconfiguration most systems have: 3 workers is almost always too few for production. If you have 50 tables all needing vacuum simultaneously, only 3 can be processed concurrently. The others wait.

ini
# Increase for systems with many tables or high write throughput
autovacuum_max_workers = 6  # or higher for very active systems

Each worker processes one table at a time. It acquires a lightweight lock (no exclusive lock — normal reads and writes continue) and performs the VACUUM steps described above.

When Autovacuum Triggers

Autovacuum triggers a vacuum of a table when the estimated number of dead tuples exceeds:

vacuum_threshold = autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * reltuples

With defaults:

ini
autovacuum_vacuum_threshold = 50        # minimum dead tuples before considering vacuum
autovacuum_vacuum_scale_factor = 0.2    # 20% of live tuples

For a table with 1,000,000 live rows:

vacuum_threshold = 50 + 0.2 * 1,000,000 = 200,050 dead tuples

Autovacuum waits until 200,050 dead tuples accumulate before vacuuming. On a table processing 1,000 updates/second, that is 200 seconds of dead tuple accumulation before cleanup begins.

For large tables, the default scale factor is catastrophically wrong.

A table with 100 million rows requires 20 million dead tuples before autovacuum triggers. At 10,000 updates/second, that is 33 minutes of bloat accumulation per vacuum cycle. The table can grow to many times its logical size before vacuum catches up.

The fix: reduce autovacuum_vacuum_scale_factor for large, write-heavy tables. We'll cover per-table tuning shortly.

When Autovacuum Triggers ANALYZE

Autovacuum also triggers ANALYZE when:

analyze_threshold = autovacuum_analyze_threshold + autovacuum_analyze_scale_factor * reltuples

Default: 50 + 0.1 * reltuples. For a 1M-row table: 100,050 row changes before statistics are updated.

Autovacuum Cost-Based Throttling

Here is where autovacuum's second major misconfiguration lives.

To avoid overwhelming the I/O subsystem, autovacuum workers throttle themselves using a cost-based delay mechanism:

ini
autovacuum_vacuum_cost_delay = 2ms     # how long to sleep after hitting cost limit
autovacuum_vacuum_cost_limit = 200     # cost units before sleeping

# Cost of each I/O operation (in cost units):
vacuum_cost_page_hit   = 1    # page found in shared_buffers
vacuum_cost_page_miss  = 10   # page read from disk
vacuum_cost_page_dirty = 20   # page written (dirtied by vacuum)

How throttling works: The autovacuum worker accumulates cost as it reads and writes pages. When accumulated cost hits autovacuum_vacuum_cost_limit (200 by default), the worker sleeps for autovacuum_vacuum_cost_delay (2ms by default), then resets its cost counter and continues.

The throughput math:

In 2ms of sleep per 200 cost units:

If all reads are cache misses (vacuum_cost_page_miss = 10): 20 pages per cycle
At 8KB per page: 160KB per 2ms cycle
Maximum throughput: 80MB/s... but this assumes zero actual I/O time

In practice, the disk reads themselves take time. With a cost limit of 200 and page_miss = 10, the worker reads ~20 pages before sleeping. On a cold cache with SSD reads at ~100μs each, 20 pages takes ~2ms to read — then another 2ms of sleep. Effective throughput: ~40MB/s of heap scan.

Why the defaults are too conservative:

On modern NVMe storage, 40MB/s vacuum throughput is extremely slow. A table with 10GB of dead tuple bloat will take 250 seconds (over 4 minutes) of pure vacuum I/O time to clean — and that's before the multiple passes for large dead tuple lists.

For systems with fast storage:

ini
autovacuum_vacuum_cost_delay = 0       # no sleep between cost cycles (aggressive)
# or
autovacuum_vacuum_cost_delay = 1ms     # reduce sleep by 50%
autovacuum_vacuum_cost_limit = 800     # allow 4x more work per cycle

Setting cost_delay = 0 makes autovacuum run at full I/O speed — appropriate for NVMe systems where vacuum I/O doesn't compete with query I/O significantly.

Transaction ID Wraparound: The Shutdown You Cannot Avoid

This is the most serious consequence of misconfigured autovacuum, and it deserves its own section.

How XID Wraparound Works

As covered in Module 2, transaction IDs (XIDs) are 32-bit integers. The database can have at most ~2 billion XIDs in flight before it wraps around. XID wraparound is not theoretical — every database will face it if it runs long enough.

Postgres's solution: freeze old tuples. A frozen tuple's xmin is replaced with a special FrozenTransactionId (XID 2) that is always considered visible to all transactions. Frozen tuples are invisible to XID comparison logic — they can never appear to be "in the future" after wraparound.

The Wraparound Safety Threshold

Postgres starts issuing warnings at vacuum_warn_age XIDs from wraparound and begins refusing new transactions at vacuum_freeze_min_age XIDs from wraparound:

ini
# Freeze tuples older than this many transactions
vacuum_freeze_min_age = 50000000       # 50 million transactions

# Issue warning when oldest unfrozen XID is this close to wraparound
vacuum_freeze_table_age = 150000000    # 150 million transactions

# PANIC and refuse connections at this threshold
autovacuum_freeze_max_age = 200000000  # 200 million transactions

When the oldest unfrozen XID in any table is within autovacuum_freeze_max_age of wraparound, Postgres will aggressively vacuum that table — ignoring cost throttling — to freeze old tuples. This is the autovacuum wraparound prevention vacuum.

If autovacuum cannot keep up with freezing and your database reaches the wraparound limit:

WARNING: database "yourdb" must be vacuumed within 1000000 transactions
ERROR: database is not accepting commands to avoid wraparound data loss
HINT: Stop the postmaster and vacuum that database in single-user mode.

Your database stops accepting writes. This is not a theoretical risk — it has happened to production databases at large companies. The fix requires downtime.

Monitoring Wraparound Risk

sql
-- Find tables most at risk of XID wraparound
SELECT
  schemaname,
  relname,
  age(relfrozenxid) AS xid_age,
  pg_size_pretty(pg_total_relation_size(oid)) AS table_size,
  ROUND(100.0 * age(relfrozenxid) / 2000000000, 2) AS pct_toward_wraparound
FROM pg_class
WHERE relkind = 'r'
ORDER BY age(relfrozenxid) DESC
LIMIT 20;

-- Alert threshold: xid_age > 1,500,000,000 (75% of 2B)
-- Action threshold: xid_age > 1,800,000,000 (90% of 2B)

sql
-- Check autovacuum freeze progress on a specific table
SELECT
  last_autovacuum,
  last_autoanalyze,
  n_dead_tup,
  n_live_tup,
  relname
FROM pg_stat_user_tables
WHERE relname = 'transactions'
ORDER BY n_dead_tup DESC;

Table Bloat: Diagnosing and Responding

What Bloat Looks Like

Bloat is the difference between the physical size of a table/index and the size that would be needed if all dead tuples and free space were removed.

sql
-- Quick bloat estimate using pg_stats
SELECT
  schemaname,
  tablename,
  pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size,
  pg_size_pretty(
    pg_total_relation_size(schemaname||'.'||tablename) -
    (pg_relation_size(schemaname||'.'||tablename) *
     (CASE WHEN avg_width > 0
      THEN ceil(n_live_tup * (avg_width + 24) / 8192.0) / (pg_relation_size(schemaname||'.'||tablename) / 8192.0)
      ELSE 1 END))::bigint
  ) AS estimated_bloat
FROM pg_stats
JOIN pg_class ON relname = tablename
WHERE schemaname = 'public'
  AND pg_total_relation_size(schemaname||'.'||tablename) > 10000000
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

For production use, the pgstattuple extension gives precise bloat measurements:

sql
CREATE EXTENSION pgstattuple;

-- Precise table bloat (performs full scan — use on replicas or off-peak)
SELECT
  table_len,
  tuple_count,
  tuple_len,
  tuple_percent,
  dead_tuple_count,
  dead_tuple_len,
  dead_tuple_percent,
  free_space,
  free_percent
FROM pgstattuple('transactions');

-- dead_tuple_percent > 20% = vacuum is not keeping up
-- free_percent > 30% = previously bloated, now partially reclaimed

Index Bloat

Index bloat is separate from table bloat and often worse. B-tree indexes accumulate dead entries that are not reclaimed until VACUUM processes them. A heavily-updated table with an index on the updated column can have an index 5–10x larger than its logical content.

sql
-- Check index bloat using pgstattuple
SELECT
  indexrelname,
  pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
  pg_size_pretty(pgsi.leaf_fragmentation::bigint * 8192) AS fragmentation_estimate,
  pgsi.avg_leaf_density,
  pgsi.leaf_pages_removed
FROM pg_stat_user_indexes pgsi
JOIN pg_index ON indexrelid = pgsi.indexrelid
WHERE pg_relation_size(indexrelid) > 10000000
ORDER BY pg_relation_size(indexrelid) DESC;

-- Or use pgstatindex for B-tree indexes specifically:
SELECT * FROM pgstatindex('idx_transactions_block_height');
-- avg_leaf_density < 50% = significant bloat
-- leaf_fragmentation > 30% = consider REINDEX

Responding to Bloat

Option 1: Manual VACUUM (immediate, non-locking)

sql
-- Run VACUUM on a specific table with progress reporting
VACUUM (VERBOSE, ANALYZE) transactions;

-- Watch progress in another session
SELECT
  phase,
  heap_blks_scanned,
  heap_blks_total,
  index_vacuum_count,
  num_dead_tuples,
  pg_size_pretty(heap_blks_scanned * 8192) AS scanned
FROM pg_stat_progress_vacuum;

Option 2: VACUUM FULL (shrinks file, requires exclusive lock)

sql
-- Blocks ALL reads and writes for the duration
-- Only use during a maintenance window on non-critical tables
VACUUM FULL transactions;

VACUUM FULL rewrites the entire table into a new file. It acquires AccessExclusiveLock — no reads, no writes during the operation. On a 100GB table it takes hours. Use sparingly and always during a maintenance window.

Option 3: pg_repack (online, no exclusive lock)

pg_repack is an extension that rebuilds the table online without extended locking. It copies the table to a new structure, replays changes that happened during the copy, and atomically swaps the old and new tables. The exclusive lock is held only for milliseconds at the end.

bash
pg_repack -h localhost -d mydb -t transactions

This is the production-correct answer for tables that need physical compaction without a maintenance window.

Per-Table Autovacuum Tuning

The global autovacuum settings are a baseline. Individual tables that have very different workloads should have their autovacuum parameters overridden using storage parameters:

sql
-- High-write table: vacuum more aggressively
ALTER TABLE transactions SET (
  autovacuum_vacuum_scale_factor = 0.01,    -- vacuum when 1% dead (not 20%)
  autovacuum_vacuum_threshold = 1000,        -- minimum 1000 dead tuples
  autovacuum_analyze_scale_factor = 0.005,  -- analyze when 0.5% changed
  autovacuum_vacuum_cost_delay = 0,          -- no throttling for this table
  autovacuum_vacuum_cost_limit = 1000        -- higher cost limit
);

-- Read-heavy table: reduce autovacuum frequency to minimize I/O interference
ALTER TABLE reference_data SET (
  autovacuum_vacuum_scale_factor = 0.5,     -- only vacuum when 50% dead
  autovacuum_vacuum_cost_delay = 10          -- be gentle with I/O
);

-- Table with extreme write rate (e.g., event ingestion):
ALTER TABLE raw_events SET (
  autovacuum_vacuum_scale_factor = 0.0,     -- always use threshold only
  autovacuum_vacuum_threshold = 10000,       -- vacuum every 10k dead tuples
  autovacuum_vacuum_insert_scale_factor = 0.0, -- also vacuum after inserts
  autovacuum_vacuum_insert_threshold = 100000  -- after 100k inserts
);

The Insert-Only Table Problem

Tables that are only inserted into — never updated or deleted — still need vacuuming. Why? Because inserts create heap pages that may be partially filled, and the Visibility Map bits for those pages are not set until VACUUM processes them. Without VM bits set, index-only scans cannot be used on that table.

Postgres 13+ added autovacuum triggering for insert-heavy tables:

ini
autovacuum_vacuum_insert_scale_factor = 0.2   # trigger after 20% of reltuples inserted
autovacuum_vacuum_insert_threshold = 1000      # minimum 1000 inserts

For append-only tables (like time-series or event tables), set a low insert threshold so the VM gets updated regularly and index-only scans remain available.

Autovacuum and Replication Lag

A subtle but important interaction: autovacuum on the primary can cause replication lag on standbys.

When autovacuum processes a large table, it generates significant WAL:

Full page writes for every page it modifies
Index cleanup records
Visibility map updates
Heap page reclaim records

This WAL burst must be applied by the standby's recovery process. If the primary generates WAL faster than the standby can apply it, replication lag spikes during autovacuum runs.

sql
-- Identify which autovacuum is causing lag
SELECT
  pid,
  wait_event_type,
  wait_event,
  query,
  state
FROM pg_stat_activity
WHERE query LIKE 'autovacuum:%';

-- On the standby, check current apply lag
SELECT
  now() - pg_last_xact_replay_timestamp() AS replication_lag;

Mitigation strategies:

Throttle autovacuum during peak hours using autovacuum_vacuum_cost_delay — set it higher during business hours and lower overnight via a scheduled ALTER TABLE.
Increase checkpoint_completion_target to 0.9 — this spreads the I/O from checkpoint more evenly, reducing WAL burst.
Use vacuum_cost_delay for manual VACUUM during off-peak to pre-empt autovacuum doing it during peak hours.

The Autovacuum Wraparound Prevention Vacuum

When a table's oldest unfrozen XID approaches autovacuum_freeze_max_age, autovacuum enters anti-wraparound mode for that table. In this mode:

Cost throttling is ignored — the worker runs at full speed
The vacuum is logged as autovacuum: VACUUM (to prevent wraparound)
The vacuum processes the entire table, not just the pages with dead tuples

This means a full table scan of your largest table at full I/O speed, triggered automatically, with no throttling. On a 1TB table, this can saturate I/O for hours.

The way to prevent surprise anti-wraparound vacuums: keep autovacuum_freeze_max_age at 200M (default) but ensure regular autovacuums are keeping XID ages low. Monitor age(relfrozenxid) and alert before it reaches 150M.

sql
-- Tables that have had wraparound prevention vacuums recently
-- (look in postgresql.log for 'to prevent wraparound' messages)

-- Proactive: find tables where regular autovacuum is not freezing fast enough
SELECT
  relname,
  age(relfrozenxid) AS xid_age,
  last_autovacuum,
  last_vacuum
FROM pg_stat_user_tables
JOIN pg_class ON relname = relname
WHERE age(relfrozenxid) > 100000000  -- 100M XIDs old
ORDER BY age(relfrozenxid) DESC;

Production Monitoring Queries

The Autovacuum Health Dashboard

sql
-- Tables that need vacuum most urgently
SELECT
  schemaname,
  relname,
  n_dead_tup,
  n_live_tup,
  ROUND(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 1) AS dead_pct,
  last_autovacuum,
  last_autoanalyze,
  autovacuum_count,
  pg_size_pretty(pg_total_relation_size(schemaname||'.'||relname)) AS total_size
FROM pg_stat_user_tables
WHERE n_live_tup > 10000
ORDER BY n_dead_tup DESC
LIMIT 20;

sql
-- Currently running autovacuum workers
SELECT
  pid,
  now() - xact_start AS duration,
  query AS autovacuum_target,
  state
FROM pg_stat_activity
WHERE query LIKE 'autovacuum:%'
ORDER BY xact_start;

sql
-- Autovacuum effectiveness: tables where autovacuum is not keeping up
SELECT
  relname,
  n_dead_tup,
  last_autovacuum,
  EXTRACT(EPOCH FROM (now() - last_autovacuum)) / 3600 AS hours_since_vacuum,
  autovacuum_count
FROM pg_stat_user_tables
WHERE n_dead_tup > 100000
  AND (last_autovacuum IS NULL OR last_autovacuum < now() - INTERVAL '6 hours')
ORDER BY n_dead_tup DESC;

sql
-- XID age health check for all tables
SELECT
  relname,
  age(relfrozenxid) AS xid_age,
  CASE
    WHEN age(relfrozenxid) > 1800000000 THEN 'CRITICAL'
    WHEN age(relfrozenxid) > 1500000000 THEN 'WARNING'
    WHEN age(relfrozenxid) > 1000000000 THEN 'WATCH'
    ELSE 'OK'
  END AS status
FROM pg_class
WHERE relkind = 'r'
ORDER BY age(relfrozenxid) DESC
LIMIT 20;

Production Tuning Reference

ini
# Worker count — increase for busy systems
autovacuum_max_workers = 6

# Trigger sensitivity — critical to reduce for large tables
autovacuum_vacuum_scale_factor = 0.05      # 5% instead of 20%
autovacuum_vacuum_threshold = 1000
autovacuum_analyze_scale_factor = 0.02     # 2% instead of 10%

# Cost throttling — relax for fast storage
autovacuum_vacuum_cost_delay = 2ms         # keep at 2ms unless you have NVMe
autovacuum_vacuum_cost_limit = 400         # double the default

# Memory for vacuum
autovacuum_work_mem = 256MB               # separate from maintenance_work_mem

# Freezing — keep defaults unless you have >500M XID/day throughput
autovacuum_freeze_max_age = 200000000
vacuum_freeze_min_age = 50000000
vacuum_freeze_table_age = 150000000

# Naptime between launcher checks
autovacuum_naptime = 1min                  # default; reduce to 30s for high-write systems

The Production Incident: Autovacuum Lagging Behind Replication Slot

Context: A blockchain indexer writing ~2,000 rows/second to a raw_blocks table.

What happened:

A logical replication slot was created for a CDC pipeline consuming block data. The CDC consumer was taken offline for 2 days during an infrastructure migration. Nobody dropped the slot or monitored it.

During those 2 days:

The primary continued writing at 2,000 rows/second: ~345 million inserts
The replication slot retained all WAL since its restart_lsn
pg_wal/ grew to 200GB before alerts fired
The OldestXmin — the oldest XID visible to any consumer — was pinned by the slot's catalog_xmin

The pinned OldestXmin meant autovacuum could not reclaim dead tuples older than that XID. Even though the blocks table was mostly insert-only, the raw_blocks_status summary table (updated constantly) accumulated dead tuples that could not be reclaimed.

After 48 hours: the summary table had 800 million dead tuples. Autovacuum was running continuously but making no progress — every tuple it tried to reclaim was newer than the pinned OldestXmin.

Resolution:

sql
-- 1. Identify the offending slot
SELECT slot_name, catalog_xmin, pg_size_pretty(
  pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
) AS retained_wal
FROM pg_replication_slots;

-- 2. Drop the slot (after confirming the consumer is offline)
SELECT pg_drop_replication_slot('cdc_pipeline_slot');

-- 3. OldestXmin immediately advances; manual vacuum to clean up
VACUUM ANALYZE raw_blocks_status;

-- 4. Monitor progress
SELECT phase, heap_blks_scanned, heap_blks_total, num_dead_tuples
FROM pg_stat_progress_vacuum;

The slot drop allowed OldestXmin to advance. Autovacuum immediately reclaimed 800 million dead tuples over 6 hours. pg_wal/ was cleaned up. Table size returned to normal.

The lessons:

Always monitor pg_replication_slots for inactive slots
Set max_slot_wal_keep_size to cap WAL retention per slot:

ini
max_slot_wal_keep_size = 10GB   # slot cannot retain more than 10GB of WAL

Alert on pg_replication_slots where active = false for more than 1 hour
catalog_xmin pinning is separate from restart_lsn — watch both

Summary

Concept	Key Takeaway
What VACUUM does	Scans heap, removes dead tuple entries from indexes, reclaims heap space, updates FSM and VM
What VACUUM doesn't do	Shrink the file, compact live tuples, update column statistics
Default scale factor	20% is too high for any table over 10M rows — use 1–5% instead
Cost throttling	Default 2ms/200 limits is conservative for NVMe; increase limit or reduce delay
Workers	Default 3 is too few for busy systems; use 6+
XID wraparound	Monitor `age(relfrozenxid)` — alert at 150M, action at 180M
Index bloat	Separate from table bloat; use `pgstatindex()` to measure
Replication slot pinning	Inactive slots pin `OldestXmin` and block dead tuple reclamation
Anti-wraparound vacuum	Runs at full speed, ignoring cost throttle — prevent it with proactive monitoring
Per-table tuning	Use `ALTER TABLE ... SET (autovacuum_*)` to override globals for specific tables

Autovacuum is the custodian that keeps MVCC from making your database unusable. Module 5 goes into the data structures that make reads fast — indexes — including how they are built, how they degrade, and when the wrong index hurts more than no index at all.

Next: Module 5 — Indexes: B-Tree Internals, GIN, GiST, and When Each One Hurts You →

pg_repack — Online Table Compaction Without Locks

VACUUM marks dead tuples as reusable. It does NOT return disk space to the operating system. The table file on disk stays the same size. After bulk-deleting 50% of an 80GB table, the table file is still 80GB. The space is marked free internally but the file system reports no change.

This matters operationally: your storage bill doesn't shrink. More critically, sequential scans of the table still read the entire 80GB file (including empty pages), slowing full-table queries. A query plan that expects to read 40GB of live data is reading 80GB of file — half of every I/O is wasted on pages that hold nothing.

Why VACUUM FULL Is Not the Answer

VACUUM FULL rewrites the entire table, reclaiming disk space. But it acquires AccessExclusiveLock — blocking ALL reads and writes — for the entire rewrite duration. On an 80GB table, that's 30–90 minutes of total table unavailability. Not acceptable for production.

The table is essentially offline for the duration. If your application retries, it queues connections. If it doesn't retry, users see errors. Either way: not a tool for production hours.

pg_repack — What It Does

pg_repack is a PostgreSQL extension that performs an online table repack: it builds a compacted copy of the table in the background while the original table continues to serve reads and writes, then atomically swaps the new table in using a brief AccessExclusiveLock that lasts approximately 1 second regardless of table size.

bash
# Install (version must match your PostgreSQL major version)
apt-get install postgresql-16-repack

# Repack a specific table
pg_repack --host=localhost --username=postgres --dbname=myapp --table=orders

# Repack entire database
pg_repack --host=localhost --username=postgres --dbname=myapp

# Dry run to estimate space savings without making any changes
pg_repack --dry-run --dbname=myapp --table=orders

The process under the hood:

Creates a log table to capture all ongoing changes to the target table
Builds a new compacted table from the original (full table scan in the background, no locks)
Replays the log table's captured changes onto the new table
Acquires AccessExclusiveLock briefly (~1 second) to swap the table files
Drops the original bloated table and the log table

The 1-second exclusive lock is for the swap only — it is not proportional to table size. Your application sees a brief hiccup, not a multi-minute outage.

When to Run pg_repack

Signals that a table needs repacking:

sql
-- Check table bloat: compare pg_relation_size to estimated live data size
SELECT
  schemaname,
  tablename,
  pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size,
  pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) AS heap_size,
  n_live_tup,
  n_dead_tup,
  ROUND(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 1) AS dead_pct
FROM pg_tables
JOIN pg_stat_user_tables USING (schemaname, tablename)
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

A table with dead_pct > 40% that has not shrunk despite autovacuum running is a candidate for repacking. Autovacuum reclaims the space internally but cannot shrink the file — only pg_repack or VACUUM FULL can do that.

Also check: a table whose pg_relation_size is 3x or more what you would expect given its row count. For example, an orders table with 10 million rows should be roughly 5–15GB depending on row width. If it's 60GB, bloat is the likely explanation.

sql
-- Quick check: tables where last vacuum ran but file hasn't shrunk
-- (high n_dead_tup relative to n_live_tup despite recent autovacuum)
SELECT
  relname,
  n_live_tup,
  n_dead_tup,
  last_autovacuum,
  pg_size_pretty(pg_relation_size(schemaname||'.'||relname)) AS heap_size
FROM pg_stat_user_tables
JOIN pg_class ON relname = pg_class.relname
WHERE n_dead_tup > 1000000
  AND last_autovacuum > now() - INTERVAL '24 hours'
ORDER BY n_dead_tup DESC;

Operational Notes

Disk space requirement: pg_repack needs free disk space equal to the size of the table being repacked (it builds a full copy). Before repacking an 80GB table, verify you have at least 80GB free on the data volume. df -h $(pg_lsclusters -h | awk '{print $6}') on Linux shows available space on the PostgreSQL data directory.

Resource usage: pg_repack runs a full table scan and index rebuild — it is I/O intensive. Use --wait-timeout and --no-kill-backend to avoid disrupting existing connections:

bash
# Conservative: wait up to 60s for conflicting locks, don't kill existing backends
pg_repack --dbname=myapp --table=orders --wait-timeout=60 --no-kill-backend

# More aggressive: lower timeout, allow killing backends if necessary
pg_repack --dbname=myapp --table=orders --wait-timeout=10

Monitor progress: pg_repack logs progress to stderr. On large tables (>50GB), expect 30–90 minutes. Run it in a tmux session or with output redirected to a log file.

Index rebuild: pg_repack also rebuilds all indexes as part of the repack. This means your indexes come out of the repack defragmented and compact — an additional benefit over VACUUM FULL which also rebuilds indexes, but with the exclusive lock held throughout.

Scheduling: Run pg_repack during off-peak hours. The table is fully available during the repack, but the I/O load is real and will compete with query I/O. On a system already I/O constrained, running pg_repack during peak hours will slow down queries even though the table remains accessible.

PreviousModule 3: Write-Ahead Logging: Durability, Replication, and the Price of Every Write Next Module 5: Indexes: B-Tree Internals, GIN, GiST, and When Each One Hurts You