Module 12·40 min read

The difference between junior and senior is not SQL knowledge — it is having the right runbooks before the incident.

Module 12 — Production Operations: Monitoring, Migration, and the Runbooks That Matter

What this module covers: The difference between a junior and a senior database engineer is not SQL knowledge — it is having the right runbooks before the incident, the right monitoring before the alert, and the right migration plan before the deployment. This module covers the operational stack: what to measure and how, connection pooling with PgBouncer, backup and restore, zero-downtime migration patterns, and the five runbooks that every team running Postgres in production needs to have written down before they need them.

The Monitoring Stack

You cannot operate what you cannot observe. Postgres exposes an exceptional amount of internal state through pg_stat_* views. The challenge is not access — it is knowing which metrics to collect, what thresholds to alert on, and what to do when they fire.

The Essential Views

sql
-- 1. Active connections and what they are doing
SELECT
  pid,
  usename,
  application_name,
  client_addr,
  state,                           -- 'active', 'idle', 'idle in transaction', 'disabled'
  wait_event_type,                 -- 'Lock', 'LWLock', 'IO', 'Client', etc.
  wait_event,
  ROUND(EXTRACT(EPOCH FROM (now() - state_change))::numeric, 1) AS state_seconds,
  LEFT(query, 100) AS query_snippet
FROM pg_stat_activity
WHERE pid != pg_backend_pid()
ORDER BY state_seconds DESC;

idle in transaction connections are the most dangerous. They hold open MVCC snapshots, pin OldestXmin, and block autovacuum from reclaiming dead tuples (Module 2 and 4). Any connection idle in transaction for more than 30 seconds deserves investigation.

sql
-- 2. Table-level access statistics
SELECT
  relname,
  seq_scan,
  seq_tup_read,
  idx_scan,
  idx_tup_fetch,
  n_tup_ins,
  n_tup_upd,
  n_tup_del,
  n_live_tup,
  n_dead_tup,
  ROUND(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 1) AS dead_pct,
  last_autovacuum,
  last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;

sql
-- 3. Index usage
SELECT
  relname AS table,
  indexrelname AS index,
  idx_scan,
  idx_tup_read,
  pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC
LIMIT 20;
-- idx_scan = 0 on large indexes = candidate for removal

sql
-- 4. Cache hit ratio (should be > 99% for OLTP)
SELECT
  sum(heap_blks_hit) AS heap_hits,
  sum(heap_blks_read) AS heap_reads,
  ROUND(
    100.0 * sum(heap_blks_hit) /
    NULLIF(sum(heap_blks_hit) + sum(heap_blks_read), 0), 2
  ) AS cache_hit_pct
FROM pg_statio_user_tables;

-- Per-table cache hit
SELECT
  relname,
  heap_blks_hit,
  heap_blks_read,
  ROUND(100.0 * heap_blks_hit / NULLIF(heap_blks_hit + heap_blks_read, 0), 2) AS hit_pct
FROM pg_statio_user_tables
WHERE heap_blks_hit + heap_blks_read > 1000
ORDER BY heap_blks_read DESC;

sql
-- 5. Lock waits (queries waiting for locks)
SELECT
  blocked.pid,
  blocked.query AS blocked_query,
  blocking.pid AS blocking_pid,
  blocking.query AS blocking_query,
  blocked.wait_event,
  ROUND(EXTRACT(EPOCH FROM (now() - blocked.state_change))::numeric, 1) AS waiting_seconds
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking
  ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE blocked.wait_event_type = 'Lock';

sql
-- 6. Long-running queries
SELECT
  pid,
  usename,
  ROUND(EXTRACT(EPOCH FROM (now() - query_start))::numeric, 1) AS query_seconds,
  state,
  LEFT(query, 120) AS query
FROM pg_stat_activity
WHERE state = 'active'
  AND query_start < now() - INTERVAL '30 seconds'
ORDER BY query_seconds DESC;

`pg_stat_statements`: Query Performance Over Time

sql
-- Top 15 queries by total time (install extension first)
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

SELECT
  LEFT(query, 100)                               AS query,
  calls,
  ROUND(total_exec_time::numeric / 1000, 2)      AS total_sec,
  ROUND(mean_exec_time::numeric, 2)              AS mean_ms,
  ROUND(stddev_exec_time::numeric, 2)            AS stddev_ms,
  ROUND(100.0 * total_exec_time /
    sum(total_exec_time) OVER (), 2)             AS pct_total,
  rows / NULLIF(calls, 0)                        AS avg_rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 15;

sql
-- Queries with highest I/O cost (cache miss heavy)
SELECT
  LEFT(query, 100) AS query,
  calls,
  shared_blks_read AS disk_reads,
  shared_blks_hit AS cache_hits,
  ROUND(100.0 * shared_blks_read /
    NULLIF(shared_blks_hit + shared_blks_read, 0), 1) AS miss_pct,
  ROUND(mean_exec_time::numeric, 2) AS mean_ms
FROM pg_stat_statements
WHERE shared_blks_hit + shared_blks_read > 10000
ORDER BY shared_blks_read DESC
LIMIT 10;

Alerting Thresholds

Metric	Warning	Critical	Action
Connections used (% of `max_connections`)	> 70%	> 85%	Check for connection leaks; increase PgBouncer pool
`idle in transaction` connections	> 5	> 20	Kill offending connections; fix application code
Replication lag (bytes)	> 100MB	> 500MB	Investigate standby I/O; reduce write load
Dead tuple % on any table	> 20%	> 40%	Run manual VACUUM; tune autovacuum for that table
XID age on any table	> 1B	> 1.5B	Run VACUUM FREEZE immediately
Cache hit ratio	< 99%	< 95%	Increase `shared_buffers`; optimize query working set
`pg_wal` directory size	> 60% of partition	> 80%	Check inactive replication slots; increase disk
Long-running queries	> 30s	> 5min	Identify and kill; add timeout: `statement_timeout`
Lock wait time	> 5s	> 30s	Find blocking query; investigate lock contention

ini
# Automatic statement timeout — kills runaway queries
statement_timeout = '300s'       # 5 minutes max for any query
lock_timeout = '30s'             # fail fast if can't acquire lock
idle_in_transaction_session_timeout = '60s'  # kill idle-in-transaction after 60s

idle_in_transaction_session_timeout is one of the most important production settings that most teams don't have set. It automatically terminates connections that have been idle-in-transaction for too long — eliminating the entire class of "someone left a transaction open in their REPL and the table hasn't vacuumed in 6 hours" incidents.

Connection Pooling: PgBouncer

Every Postgres connection is a full OS process (Module 0). Above a few hundred connections, fork overhead and memory pressure become significant. PgBouncer is the standard connection pooler — a lightweight proxy that multiplexes many application connections onto a smaller number of Postgres connections.

PgBouncer Pooling Modes

Session mode:

One Postgres connection is assigned to an application connection for the entire session
Prepared statements, advisory locks, and SET parameters work correctly
Pool size = max simultaneous sessions ≈ similar to no pooler for connection count
Use when: you need prepared statements or session-level state

Transaction mode:

A Postgres connection is held only during a transaction; returned to the pool after COMMIT/ROLLBACK
One Postgres connection can serve many sequential application transactions
Breaks: prepared statements (not persisted across transactions), advisory locks, SET parameters, LISTEN/NOTIFY
Use when: high-throughput OLTP with short transactions and no prepared statements
Most PgBouncer deployments use this mode

Statement mode:

A Postgres connection is held only for a single statement
Breaks: multi-statement transactions (each statement gets its own connection → no transaction atomicity)
Rarely used in practice

ini
# pgbouncer.ini
[databases]
mydb = host=primary port=5432 dbname=indexer

[pgbouncer]
pool_mode = transaction
max_client_conn = 5000       # max application connections to PgBouncer
default_pool_size = 50       # Postgres connections per database per user
min_pool_size = 10
reserve_pool_size = 5        # extra connections for burst
reserve_pool_timeout = 3     # seconds before using reserve pool

server_idle_timeout = 600    # close idle Postgres connections after 10 min
server_lifetime = 3600       # recycle Postgres connections after 1 hour
client_idle_timeout = 0      # don't close idle application connections (app handles this)

# Authentication
auth_type = scram-sha-256
auth_file = /etc/pgbouncer/userlist.txt

Sizing the Pool

The optimal number of Postgres connections is not "as many as possible" — it is the number that saturates your hardware without thrashing.

A well-known formula (from PgBouncer's own documentation):

optimal_pool_size ≈ (num_cores × 2) + num_effective_spindles

For a 16-core server with NVMe (no spindles):

optimal ≈ 16 × 2 + 1 = 33 connections

More connections than this means CPU context switching and memory pressure outweigh the benefit of additional concurrency. PgBouncer's default_pool_size = 30–50 for a modern server is a reasonable starting point.

sql
-- Monitor PgBouncer pool utilization (from PgBouncer admin console)
SHOW POOLS;
-- cl_active: clients currently using a server connection
-- cl_waiting: clients waiting for a connection (this should be near 0)
-- sv_active: server connections currently in use
-- sv_idle: server connections available in the pool

cl_waiting > 0 consistently means your pool is undersized for the workload. Either increase default_pool_size (requires more Postgres connections) or reduce application connection count.

Backup Strategies

`pg_dump`: Logical Backup

bash
# Full database dump (SQL format)
pg_dump -h localhost -U postgres -d indexer \
  --format=custom \        # custom binary format, fastest restore
  --compress=9 \           # maximum compression
  --file=indexer_backup_$(date +%Y%m%d).dump

# Restore
pg_restore -h localhost -U postgres -d indexer_restored \
  --jobs=8 \               # parallel restore with 8 workers
  indexer_backup_20260517.dump

# Table-selective dump
pg_dump -h localhost -U postgres -d indexer \
  --table=transactions \
  --format=custom \
  --file=transactions_backup.dump

Advantages: portable across Postgres versions, table-selective, works through logical replication boundaries.
Disadvantages: slow for large databases (must serialize every row), creates significant I/O load on the source, cannot be used for PITR.

`pg_basebackup`: Physical Backup

bash
# Full cluster backup (binary copy of $PGDATA)
pg_basebackup \
  -h localhost \
  -U replicator \
  -D /backup/base_$(date +%Y%m%d) \
  --format=tar \
  --compress=9 \
  --checkpoint=fast \      # force immediate checkpoint before backup starts
  --wal-method=stream \    # stream WAL during backup (ensures backup is consistent)
  --progress \
  --verbose

# Estimated size before running:
pg_basebackup --estimate-size -h localhost -U replicator -D /dev/null

Advantages: fast (parallel I/O), complete cluster copy, can be used as base for PITR, can be used directly as a standby.
Disadvantages: not portable across major versions, full cluster only (not table-selective), requires a replication connection.

pgBackRest: Production Backup Tool

For production systems, pgBackRest is the standard. It handles:

Incremental and differential backups (only changed blocks since last backup)
Parallel backup and restore
WAL archiving integrated with backup
Backup verification
Remote backup to S3, Azure, GCS

ini
# /etc/pgbackrest/pgbackrest.conf
[global]
repo1-path=/backup/pgbackrest
repo1-retention-full=2       # keep last 2 full backups
repo1-retention-diff=7       # keep 7 differential backups
compress-type=lz4
compress-level=6

[global:archive]
archive-async=y
archive-get-queue-max=128mb

[indexer]
pg1-path=/var/lib/postgresql/data
pg1-host=primary
pg1-user=postgres

bash
# Full backup
pgbackrest --stanza=indexer backup --type=full

# Incremental backup (default)
pgbackrest --stanza=indexer backup

# Restore to latest
pgbackrest --stanza=indexer restore

# PITR to specific time
pgbackrest --stanza=indexer restore \
  --target="2026-05-17 14:30:00" \
  --target-action=promote

Backup Testing

A backup you have never tested is not a backup — it is a false sense of security. Schedule quarterly restore drills:

bash
# Monthly: restore to a test instance and verify
pgbackrest --stanza=indexer restore --target-action=promote \
  --pg1-path=/tmp/restore_test

# Verify row counts match production
psql -d restored_indexer -c "SELECT count(*) FROM transactions;"
# Compare to: SELECT count(*) FROM transactions; on production

# Verify PITR capability
pgbackrest --stanza=indexer restore \
  --target="2026-05-15 00:00:00" \
  --target-action=pause \
  --pg1-path=/tmp/pitr_test

Zero-Downtime Migration Checklist

Every schema change on a live production database must be evaluated against this checklist.

Operations That Lock (Avoid Without a Plan)

sql
-- These acquire AccessExclusiveLock — blocks ALL reads and writes:
ALTER TABLE transactions ADD COLUMN x TEXT NOT NULL;  -- PG < 11 only
ALTER TABLE transactions ALTER COLUMN amount TYPE NUMERIC(40, 8);
ALTER TABLE transactions DROP COLUMN payload;         -- instant, but lock acquired
CREATE INDEX ON transactions (block_height);          -- use CONCURRENTLY instead
VACUUM FULL transactions;
CLUSTER transactions USING idx_transactions_height;

Safe Operations (No Significant Lock)

sql
-- These are safe on live tables:
ALTER TABLE transactions ADD COLUMN notes TEXT;                    -- nullable, instant
ALTER TABLE transactions ADD COLUMN version INT NOT NULL DEFAULT 1; -- PG 11+, instant
ALTER TABLE transactions ALTER COLUMN notes SET DEFAULT 'none';    -- instant
ALTER TABLE transactions ADD CONSTRAINT amount_positive
  CHECK (amount >= 0) NOT VALID;                                   -- instant (skip validation)
ALTER TABLE transactions VALIDATE CONSTRAINT amount_positive;      -- ShareUpdateExclusiveLock
CREATE INDEX CONCURRENTLY idx_transactions_sender_new
  ON transactions (sender);                                        -- no blocking
DROP INDEX CONCURRENTLY idx_transactions_sender_old;               -- no blocking

The Migration Sequence

Adding a NOT NULL column with a default (PG 10 and earlier):

sql
-- Step 1: add nullable (instant)
ALTER TABLE transactions ADD COLUMN region TEXT;

-- Step 2: backfill in batches (avoid one giant UPDATE that locks)
DO $$
DECLARE
  batch_size INT := 10000;
  last_id BIGINT := 0;
  max_id BIGINT;
BEGIN
  SELECT max(id) INTO max_id FROM transactions;
  WHILE last_id < max_id LOOP
    UPDATE transactions
    SET region = 'us-east'
    WHERE id > last_id AND id <= last_id + batch_size
      AND region IS NULL;
    last_id := last_id + batch_size;
    PERFORM pg_sleep(0.1);  -- brief pause between batches to reduce I/O pressure
  END LOOP;
END $$;

-- Step 3: add NOT NULL constraint (validates only new rows since NOT VALID)
ALTER TABLE transactions ALTER COLUMN region SET NOT NULL;
-- PG 11+: this is instant if all rows are non-null; PG 10-: full scan

Renaming a column (zero downtime):

sql
-- Step 1: add new column
ALTER TABLE transactions ADD COLUMN from_address VARCHAR(66);

-- Step 2: sync old to new via trigger (or application dual-write)
CREATE OR REPLACE FUNCTION sync_from_address()
RETURNS TRIGGER AS $$
BEGIN
  NEW.from_address := NEW.sender;
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_sync_from_address
BEFORE INSERT OR UPDATE ON transactions
FOR EACH ROW EXECUTE FUNCTION sync_from_address();

-- Step 3: backfill existing rows
UPDATE transactions SET from_address = sender WHERE from_address IS NULL;

-- Step 4: deploy application reading new column (verify in staging)
-- Step 5: deploy application writing both columns
-- Step 6: deploy application writing only new column
-- Step 7: drop trigger and old column
DROP TRIGGER trg_sync_from_address ON transactions;
ALTER TABLE transactions DROP COLUMN sender;

Adding a new index:

sql
-- Always use CONCURRENTLY — takes longer but does not block
CREATE INDEX CONCURRENTLY idx_transactions_recipient
ON transactions (recipient)
WHERE recipient IS NOT NULL;  -- partial index for non-null values

-- Monitor progress
SELECT
  phase,
  blocks_done,
  blocks_total,
  tuples_done,
  tuples_total
FROM pg_stat_progress_create_index
WHERE relid = 'transactions'::regclass;

The Five Runbooks

Every production Postgres team needs these written down before they need them. The worst time to write a runbook is during an incident.

Runbook 1: Table Bloat Emergency

Symptoms: table size growing despite stable row count; autovacuum running continuously without catching up; queries slowing down on a table.

Diagnosis:

sql
-- Check dead tuple accumulation
SELECT relname, n_dead_tup, n_live_tup,
  ROUND(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 1) AS dead_pct,
  last_autovacuum
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC LIMIT 10;

-- Check for blocking long transactions
SELECT pid, usename, state,
  EXTRACT(EPOCH FROM (now() - xact_start)) AS tx_seconds,
  LEFT(query, 80)
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start ASC LIMIT 10;

-- Check for blocking replication slots
SELECT slot_name, active,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots;

Response:

sql
-- Step 1: terminate idle-in-transaction sessions blocking vacuum
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle in transaction'
  AND xact_start < now() - INTERVAL '5 minutes';

-- Step 2: drop inactive replication slots (after confirming consumer is offline)
SELECT pg_drop_replication_slot('stalled_slot');

-- Step 3: run manual vacuum with maximum memory
SET maintenance_work_mem = '1GB';
VACUUM (VERBOSE, ANALYZE) transactions;

-- Step 4: monitor progress
SELECT phase, heap_blks_scanned, heap_blks_total, num_dead_tuples
FROM pg_stat_progress_vacuum;

Runbook 2: Replication Lag Spike

Symptoms: pg_stat_replication.replay_lag climbing; standby reads returning stale data; alerting on replication bytes lag.

Diagnosis:

sql
-- On primary: check lag breakdown
SELECT application_name, state, write_lag, flush_lag, replay_lag,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS total_lag
FROM pg_stat_replication;

-- Check if autovacuum is the WAL generator
SELECT query FROM pg_stat_activity
WHERE query LIKE 'autovacuum%' ORDER BY xact_start;

-- Check WAL generation rate
SELECT pg_size_pretty(wal_bytes) AS total_wal,
  wal_fpi, wal_records
FROM pg_stat_wal;

Response:

sql
-- If autovacuum is causing the lag burst:
-- Throttle the specific autovacuum (per-table cost delay)
ALTER TABLE transactions SET (autovacuum_vacuum_cost_delay = 20);

-- If standby is applying slowly (apply_lag >> flush_lag):
-- The standby's single-threaded apply is the bottleneck
-- Check standby hardware (CPU, I/O) — nothing to tune in Postgres directly

-- If network lag (write_lag >> 0):
-- Network between primary and standby is saturated
-- Check: ifstat, sar -n DEV on both hosts

-- Temporary: pause autovacuum on the primary during peak hours
ALTER SYSTEM SET autovacuum = off;
SELECT pg_reload_conf();
-- Remember to re-enable: ALTER SYSTEM SET autovacuum = on; SELECT pg_reload_conf();

Runbook 3: XID Wraparound Warning

Symptoms: PostgreSQL log shows WARNING: database "X" must be vacuumed within N transactions; monitoring alert on age(relfrozenxid).

Diagnosis:

sql
-- Find most at-risk tables
SELECT relname, age(relfrozenxid) AS xid_age,
  pg_size_pretty(pg_total_relation_size(oid)) AS size,
  last_autovacuum, last_vacuum
FROM pg_class
WHERE relkind = 'r'
ORDER BY age(relfrozenxid) DESC
LIMIT 10;

-- Check database-level age
SELECT datname, age(datfrozenxid) AS db_xid_age
FROM pg_database
ORDER BY age(datfrozenxid) DESC;

Response:

sql
-- Run aggressive freeze vacuum immediately on the oldest table
VACUUM FREEZE VERBOSE transactions;

-- If autovacuum is already running a wraparound vacuum (check pg_stat_activity)
-- Do NOT interrupt it — let it complete

-- For the entire database (run during low-traffic window):
VACUUM FREEZE;

-- Monitor freeze progress
SELECT phase, heap_blks_scanned, heap_blks_total, num_index_scans
FROM pg_stat_progress_vacuum;

-- After vacuum, verify age dropped:
SELECT relname, age(relfrozenxid) FROM pg_class
WHERE relkind = 'r' ORDER BY age(relfrozenxid) DESC LIMIT 5;

Runbook 4: Connection Exhaustion

Symptoms: application errors FATAL: remaining connection slots are reserved for non-replication superuser connections; new connections rejected; pg_stat_activity shows connections at or near max_connections.

Diagnosis:

sql
-- Count connections by state and application
SELECT application_name, state, count(*)
FROM pg_stat_activity
GROUP BY 1, 2
ORDER BY 3 DESC;

-- Find connection hogs
SELECT usename, application_name, count(*),
  count(*) FILTER (WHERE state = 'idle') AS idle,
  count(*) FILTER (WHERE state = 'idle in transaction') AS idle_in_tx,
  count(*) FILTER (WHERE state = 'active') AS active
FROM pg_stat_activity
WHERE pid != pg_backend_pid()
GROUP BY 1, 2
ORDER BY 3 DESC;

Response:

sql
-- Step 1: immediately kill idle connections from the offending application
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE application_name = 'offending_app'
  AND state = 'idle'
  AND state_change < now() - INTERVAL '5 minutes';

-- Step 2: kill idle-in-transaction (they are almost certainly stuck)
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle in transaction'
  AND state_change < now() - INTERVAL '2 minutes';

-- Step 3: temporary emergency measure — reduce max connections per user
ALTER ROLE app_user CONNECTION LIMIT 50;
-- Revert after fixing the connection leak:
ALTER ROLE app_user CONNECTION LIMIT -1;

-- Step 4: fix root cause
-- Usually: application not closing connections (missing connection.close())
-- Or: connection pool misconfigured (pool size > Postgres max_connections / num_apps)
-- Or: connection pool not being used (each request creates a new connection)

Runbook 5: Slow Query During Incident

Symptoms: specific query suddenly slow; pg_stat_activity shows many backends waiting; p99 latency spiking.

Diagnosis:

sql
-- Find the slowest currently running queries
SELECT pid, usename,
  ROUND(EXTRACT(EPOCH FROM (now() - query_start))::numeric, 1) AS seconds,
  wait_event_type, wait_event,
  LEFT(query, 150) AS query
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY query_start ASC
LIMIT 10;

-- Get the plan of a running query (without stopping it)
SELECT query, query_start, wait_event_type
FROM pg_stat_activity WHERE pid = <pid>;

-- Check if it is waiting for a lock
SELECT * FROM pg_blocking_pids(<pid>);

-- Get the actual plan being used (requires pg_stat_statements or auto_explain)
SELECT query, calls, mean_exec_time, total_exec_time
FROM pg_stat_statements
WHERE query LIKE '%transactions%'
ORDER BY mean_exec_time DESC LIMIT 5;

Response:

sql
-- If a lock is the problem: find and kill the blocking query
SELECT pg_terminate_backend(blocking_pid)
FROM (SELECT unnest(pg_blocking_pids(<waiting_pid>)) AS blocking_pid) t;

-- If stale statistics is the problem (plan changed after data load):
ANALYZE transactions;

-- If work_mem spill is the problem:
-- Session-level increase for the specific connection:
ALTER ROLE app_user SET work_mem = '256MB';

-- If the query is genuinely wrong (missing index, bad plan):
-- Add the index concurrently (does not block):
CREATE INDEX CONCURRENTLY idx_transactions_new ON transactions (column);

-- Emergency: cancel a specific long-running query (gentler than terminate)
SELECT pg_cancel_backend(<pid>);

-- Emergency: kill a stuck connection entirely
SELECT pg_terminate_backend(<pid>);

`auto_explain`: Automatic Plan Logging

For queries that are intermittently slow, auto_explain logs the execution plan automatically when execution time exceeds a threshold.

ini
# postgresql.conf
shared_preload_libraries = 'pg_stat_statements,auto_explain'
auto_explain.log_min_duration = 1000  # log plans for queries > 1 second
auto_explain.log_analyze = on         # include ANALYZE timing
auto_explain.log_buffers = on         # include buffer stats
auto_explain.log_nested_statements = on  # include plans from functions
auto_explain.sample_rate = 0.1        # log 10% of qualifying queries (reduce noise)

This is invaluable for diagnosing intermittent slowdowns — you get the actual plan at the time of slowness without having to reproduce it.

The Production Incident: Connection Exhaustion Cascade

Context: A blockchain indexer with 500 max_connections behind PgBouncer in transaction mode.

What happened:

A deployment introduced a bug: under certain error conditions, the application's database connection was not returned to the pool — it was abandoned while holding an open transaction. The transaction stayed idle in transaction.

At normal load: the leak rate was slow enough that connections accumulated over hours without triggering alerts.

At 2:00 AM, a spike in blockchain traffic increased error rate. The leak accelerated. Within 20 minutes:

PgBouncer's cl_waiting counter climbed (application connections queuing for a server connection)
Application timeouts increased (requests waiting for a connection from the pool)
Application retried timed-out requests (each retry needed its own connection)
The retry storm caused even more connections to be consumed
PgBouncer ran out of server connections; applications received connection refused
Applications retried again — the storm intensified

Peak: 12,000 application connections to PgBouncer, 500 Postgres connections all idle in transaction, zero connections available for legitimate queries.

Resolution sequence:

sql
-- Step 1: kill all idle-in-transaction connections (immediate relief)
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle in transaction';
-- 487 connections killed

-- Step 2: identify the application sending the most connections
SELECT application_name, count(*)
FROM pg_stat_activity
GROUP BY 1 ORDER BY 2 DESC;
-- 'indexer-worker' had 312 connections

-- Step 3: limit connections from offending app while fix is deployed
ALTER ROLE indexer_worker CONNECTION LIMIT 50;

-- Step 4: add the missing timeout to prevent recurrence
ALTER SYSTEM SET idle_in_transaction_session_timeout = '30s';
SELECT pg_reload_conf();

The root fix: idle_in_transaction_session_timeout = '30s' was added to postgresql.conf. Any connection that stays idle-in-transaction for more than 30 seconds is now automatically terminated by Postgres. The connection leak bug was fixed in the application, but the timeout ensures any future bug of the same class is self-healing.

Post-incident addition: a PgBouncer alert on cl_waiting > 100 was added — the waiting queue was the earliest visible signal, 15 minutes before the outage became critical.

Summary

Area	Key Takeaway
Monitoring	`pg_stat_activity`, `pg_stat_statements`, `pg_stat_user_tables`, `pg_stat_replication` — know these cold
`idle_in_transaction_session_timeout`	Set it. 30–60 seconds. Prevents an entire class of bloat and connection leak incidents
`statement_timeout`	Set it (300s or less). Prevents runaway queries from holding locks or consuming resources indefinitely
PgBouncer mode	Transaction mode for OLTP; session mode only if you need prepared statements
Pool sizing	`(num_cores × 2) + 1` is a reasonable starting point for Postgres connections
Backup	pg_dump for portability; pg_basebackup + WAL archive for PITR; pgBackRest for production
Backup testing	Restore drills quarterly — an untested backup is not a backup
Schema migrations	`CREATE INDEX CONCURRENTLY`, batched backfills, `NOT VALID` + `VALIDATE` pattern
Runbooks	Write them before you need them. The worst time to learn these queries is during an incident

Course Complete

You have now covered the full operational stack of PostgreSQL — from the 8KB page on disk to the replication slot across availability zones. The mental model this course set out to build in Module 0 is now complete.

The path from here:

Practice: run the pageinspect, pgstattuple, and pg_stat_* queries against a real database. The numbers will be different from the examples — that is the point. Learn to read your specific system.

Operationalize: implement the alerting thresholds, write the runbooks, add idle_in_transaction_session_timeout to every production instance you run.

Go deeper: the Postgres source code is readable C. The storage manager, the executor, the planner — all of it is there. backend/storage/buffer/, backend/executor/, backend/optimizer/ are where the concepts in this course live in code.

The database does not lie. Every behavior has a cause in the mechanics described here. When something unexpected happens in production, the answer is in pg_stat_*, in WAL, in the heap pages — you now have the model to find it.

PreviousModule 11: SQL vs NoSQL: An Engineer's Framework, Not a Marketing Debate Next Module 13: Locking Internals: Row Locks, Table Locks, and Advisory Locks

Module 12 — Production Operations: Monitoring, Migration, and the Runbooks That Matter

The Monitoring Stack

The Essential Views

pg_stat_statements: Query Performance Over Time

Alerting Thresholds

Connection Pooling: PgBouncer

PgBouncer Pooling Modes

Sizing the Pool

Backup Strategies

pg_dump: Logical Backup

pg_basebackup: Physical Backup

pgBackRest: Production Backup Tool

Backup Testing

Zero-Downtime Migration Checklist

Operations That Lock (Avoid Without a Plan)

Safe Operations (No Significant Lock)

The Migration Sequence

The Five Runbooks

Runbook 1: Table Bloat Emergency

Runbook 2: Replication Lag Spike

Runbook 3: XID Wraparound Warning

Runbook 4: Connection Exhaustion

Runbook 5: Slow Query During Incident

auto_explain: Automatic Plan Logging

The Production Incident: Connection Exhaustion Cascade

Summary

Course Complete

Discussion

`pg_stat_statements`: Query Performance Over Time

`pg_dump`: Logical Backup

`pg_basebackup`: Physical Backup

`auto_explain`: Automatic Plan Logging