The difference between junior and senior is not SQL knowledge — it is having the right runbooks before the incident.
Module 12 — Production Operations: Monitoring, Migration, and the Runbooks That Matter
What this module covers: The difference between a junior and a senior database engineer is not SQL knowledge — it is having the right runbooks before the incident, the right monitoring before the alert, and the right migration plan before the deployment. This module covers the operational stack: what to measure and how, connection pooling with PgBouncer, backup and restore, zero-downtime migration patterns, and the five runbooks that every team running Postgres in production needs to have written down before they need them.
The Monitoring Stack
You cannot operate what you cannot observe. Postgres exposes an exceptional amount of internal state through pg_stat_* views. The challenge is not access — it is knowing which metrics to collect, what thresholds to alert on, and what to do when they fire.
The Essential Views
sql-- 1. Active connections and what they are doing SELECT pid, usename, application_name, client_addr, state, -- 'active', 'idle', 'idle in transaction', 'disabled' wait_event_type, -- 'Lock', 'LWLock', 'IO', 'Client', etc. wait_event, ROUND(EXTRACT(EPOCH FROM (now() - state_change))::numeric, 1) AS state_seconds, LEFT(query, 100) AS query_snippet FROM pg_stat_activity WHERE pid != pg_backend_pid() ORDER BY state_seconds DESC;
idle in transaction connections are the most dangerous. They hold open MVCC snapshots, pin OldestXmin, and block autovacuum from reclaiming dead tuples (Module 2 and 4). Any connection idle in transaction for more than 30 seconds deserves investigation.
sql-- 2. Table-level access statistics SELECT relname, seq_scan, seq_tup_read, idx_scan, idx_tup_fetch, n_tup_ins, n_tup_upd, n_tup_del, n_live_tup, n_dead_tup, ROUND(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 1) AS dead_pct, last_autovacuum, last_autoanalyze FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 20;
sql-- 3. Index usage SELECT relname AS table, indexrelname AS index, idx_scan, idx_tup_read, pg_size_pretty(pg_relation_size(indexrelid)) AS index_size FROM pg_stat_user_indexes ORDER BY pg_relation_size(indexrelid) DESC LIMIT 20; -- idx_scan = 0 on large indexes = candidate for removal
sql-- 4. Cache hit ratio (should be > 99% for OLTP) SELECT sum(heap_blks_hit) AS heap_hits, sum(heap_blks_read) AS heap_reads, ROUND( 100.0 * sum(heap_blks_hit) / NULLIF(sum(heap_blks_hit) + sum(heap_blks_read), 0), 2 ) AS cache_hit_pct FROM pg_statio_user_tables; -- Per-table cache hit SELECT relname, heap_blks_hit, heap_blks_read, ROUND(100.0 * heap_blks_hit / NULLIF(heap_blks_hit + heap_blks_read, 0), 2) AS hit_pct FROM pg_statio_user_tables WHERE heap_blks_hit + heap_blks_read > 1000 ORDER BY heap_blks_read DESC;
sql-- 5. Lock waits (queries waiting for locks) SELECT blocked.pid, blocked.query AS blocked_query, blocking.pid AS blocking_pid, blocking.query AS blocking_query, blocked.wait_event, ROUND(EXTRACT(EPOCH FROM (now() - blocked.state_change))::numeric, 1) AS waiting_seconds FROM pg_stat_activity blocked JOIN pg_stat_activity blocking ON blocking.pid = ANY(pg_blocking_pids(blocked.pid)) WHERE blocked.wait_event_type = 'Lock';
sql-- 6. Long-running queries SELECT pid, usename, ROUND(EXTRACT(EPOCH FROM (now() - query_start))::numeric, 1) AS query_seconds, state, LEFT(query, 120) AS query FROM pg_stat_activity WHERE state = 'active' AND query_start < now() - INTERVAL '30 seconds' ORDER BY query_seconds DESC;
pg_stat_statements: Query Performance Over Time
sql-- Top 15 queries by total time (install extension first) CREATE EXTENSION IF NOT EXISTS pg_stat_statements; SELECT LEFT(query, 100) AS query, calls, ROUND(total_exec_time::numeric / 1000, 2) AS total_sec, ROUND(mean_exec_time::numeric, 2) AS mean_ms, ROUND(stddev_exec_time::numeric, 2) AS stddev_ms, ROUND(100.0 * total_exec_time / sum(total_exec_time) OVER (), 2) AS pct_total, rows / NULLIF(calls, 0) AS avg_rows FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 15;
sql-- Queries with highest I/O cost (cache miss heavy) SELECT LEFT(query, 100) AS query, calls, shared_blks_read AS disk_reads, shared_blks_hit AS cache_hits, ROUND(100.0 * shared_blks_read / NULLIF(shared_blks_hit + shared_blks_read, 0), 1) AS miss_pct, ROUND(mean_exec_time::numeric, 2) AS mean_ms FROM pg_stat_statements WHERE shared_blks_hit + shared_blks_read > 10000 ORDER BY shared_blks_read DESC LIMIT 10;
Alerting Thresholds
| Metric | Warning | Critical | Action |
|---|---|---|---|
Connections used (% of max_connections) | > 70% | > 85% | Check for connection leaks; increase PgBouncer pool |
idle in transaction connections | > 5 | > 20 | Kill offending connections; fix application code |
| Replication lag (bytes) | > 100MB | > 500MB | Investigate standby I/O; reduce write load |
| Dead tuple % on any table | > 20% | > 40% | Run manual VACUUM; tune autovacuum for that table |
| XID age on any table | > 1B | > 1.5B | Run VACUUM FREEZE immediately |
| Cache hit ratio | < 99% | < 95% | Increase shared_buffers; optimize query working set |
pg_wal directory size | > 60% of partition | > 80% | Check inactive replication slots; increase disk |
| Long-running queries | > 30s | > 5min | Identify and kill; add timeout: statement_timeout |
| Lock wait time | > 5s | > 30s | Find blocking query; investigate lock contention |
ini# Automatic statement timeout — kills runaway queries statement_timeout = '300s' # 5 minutes max for any query lock_timeout = '30s' # fail fast if can't acquire lock idle_in_transaction_session_timeout = '60s' # kill idle-in-transaction after 60s
idle_in_transaction_session_timeout is one of the most important production settings that most teams don't have set. It automatically terminates connections that have been idle-in-transaction for too long — eliminating the entire class of "someone left a transaction open in their REPL and the table hasn't vacuumed in 6 hours" incidents.
Connection Pooling: PgBouncer
Every Postgres connection is a full OS process (Module 0). Above a few hundred connections, fork overhead and memory pressure become significant. PgBouncer is the standard connection pooler — a lightweight proxy that multiplexes many application connections onto a smaller number of Postgres connections.
PgBouncer Pooling Modes
Session mode:
- One Postgres connection is assigned to an application connection for the entire session
- Prepared statements, advisory locks, and
SETparameters work correctly - Pool size = max simultaneous sessions ≈ similar to no pooler for connection count
- Use when: you need prepared statements or session-level state
Transaction mode:
- A Postgres connection is held only during a transaction; returned to the pool after
COMMIT/ROLLBACK - One Postgres connection can serve many sequential application transactions
- Breaks: prepared statements (not persisted across transactions), advisory locks,
SETparameters,LISTEN/NOTIFY - Use when: high-throughput OLTP with short transactions and no prepared statements
- Most PgBouncer deployments use this mode
Statement mode:
- A Postgres connection is held only for a single statement
- Breaks: multi-statement transactions (each statement gets its own connection → no transaction atomicity)
- Rarely used in practice
ini# pgbouncer.ini [databases] mydb = host=primary port=5432 dbname=indexer [pgbouncer] pool_mode = transaction max_client_conn = 5000 # max application connections to PgBouncer default_pool_size = 50 # Postgres connections per database per user min_pool_size = 10 reserve_pool_size = 5 # extra connections for burst reserve_pool_timeout = 3 # seconds before using reserve pool server_idle_timeout = 600 # close idle Postgres connections after 10 min server_lifetime = 3600 # recycle Postgres connections after 1 hour client_idle_timeout = 0 # don't close idle application connections (app handles this) # Authentication auth_type = scram-sha-256 auth_file = /etc/pgbouncer/userlist.txt
Sizing the Pool
The optimal number of Postgres connections is not "as many as possible" — it is the number that saturates your hardware without thrashing.
A well-known formula (from PgBouncer's own documentation):
optimal_pool_size ≈ (num_cores × 2) + num_effective_spindles
For a 16-core server with NVMe (no spindles):
optimal ≈ 16 × 2 + 1 = 33 connections
More connections than this means CPU context switching and memory pressure outweigh the benefit of additional concurrency. PgBouncer's default_pool_size = 30–50 for a modern server is a reasonable starting point.
sql-- Monitor PgBouncer pool utilization (from PgBouncer admin console) SHOW POOLS; -- cl_active: clients currently using a server connection -- cl_waiting: clients waiting for a connection (this should be near 0) -- sv_active: server connections currently in use -- sv_idle: server connections available in the pool
cl_waiting > 0 consistently means your pool is undersized for the workload. Either increase default_pool_size (requires more Postgres connections) or reduce application connection count.
Backup Strategies
pg_dump: Logical Backup
bash# Full database dump (SQL format) pg_dump -h localhost -U postgres -d indexer \ --format=custom \ # custom binary format, fastest restore --compress=9 \ # maximum compression --file=indexer_backup_$(date +%Y%m%d).dump # Restore pg_restore -h localhost -U postgres -d indexer_restored \ --jobs=8 \ # parallel restore with 8 workers indexer_backup_20260517.dump # Table-selective dump pg_dump -h localhost -U postgres -d indexer \ --table=transactions \ --format=custom \ --file=transactions_backup.dump
Advantages: portable across Postgres versions, table-selective, works through logical replication boundaries.
Disadvantages: slow for large databases (must serialize every row), creates significant I/O load on the source, cannot be used for PITR.
pg_basebackup: Physical Backup
bash# Full cluster backup (binary copy of $PGDATA) pg_basebackup \ -h localhost \ -U replicator \ -D /backup/base_$(date +%Y%m%d) \ --format=tar \ --compress=9 \ --checkpoint=fast \ # force immediate checkpoint before backup starts --wal-method=stream \ # stream WAL during backup (ensures backup is consistent) --progress \ --verbose # Estimated size before running: pg_basebackup --estimate-size -h localhost -U replicator -D /dev/null
Advantages: fast (parallel I/O), complete cluster copy, can be used as base for PITR, can be used directly as a standby.
Disadvantages: not portable across major versions, full cluster only (not table-selective), requires a replication connection.
pgBackRest: Production Backup Tool
For production systems, pgBackRest is the standard. It handles:
- Incremental and differential backups (only changed blocks since last backup)
- Parallel backup and restore
- WAL archiving integrated with backup
- Backup verification
- Remote backup to S3, Azure, GCS
ini# /etc/pgbackrest/pgbackrest.conf [global] repo1-path=/backup/pgbackrest repo1-retention-full=2 # keep last 2 full backups repo1-retention-diff=7 # keep 7 differential backups compress-type=lz4 compress-level=6 [global:archive] archive-async=y archive-get-queue-max=128mb [indexer] pg1-path=/var/lib/postgresql/data pg1-host=primary pg1-user=postgres
bash# Full backup pgbackrest --stanza=indexer backup --type=full # Incremental backup (default) pgbackrest --stanza=indexer backup # Restore to latest pgbackrest --stanza=indexer restore # PITR to specific time pgbackrest --stanza=indexer restore \ --target="2026-05-17 14:30:00" \ --target-action=promote
Backup Testing
A backup you have never tested is not a backup — it is a false sense of security. Schedule quarterly restore drills:
bash# Monthly: restore to a test instance and verify pgbackrest --stanza=indexer restore --target-action=promote \ --pg1-path=/tmp/restore_test # Verify row counts match production psql -d restored_indexer -c "SELECT count(*) FROM transactions;" # Compare to: SELECT count(*) FROM transactions; on production # Verify PITR capability pgbackrest --stanza=indexer restore \ --target="2026-05-15 00:00:00" \ --target-action=pause \ --pg1-path=/tmp/pitr_test
Zero-Downtime Migration Checklist
Every schema change on a live production database must be evaluated against this checklist.
Operations That Lock (Avoid Without a Plan)
sql-- These acquire AccessExclusiveLock — blocks ALL reads and writes: ALTER TABLE transactions ADD COLUMN x TEXT NOT NULL; -- PG < 11 only ALTER TABLE transactions ALTER COLUMN amount TYPE NUMERIC(40, 8); ALTER TABLE transactions DROP COLUMN payload; -- instant, but lock acquired CREATE INDEX ON transactions (block_height); -- use CONCURRENTLY instead VACUUM FULL transactions; CLUSTER transactions USING idx_transactions_height;
Safe Operations (No Significant Lock)
sql-- These are safe on live tables: ALTER TABLE transactions ADD COLUMN notes TEXT; -- nullable, instant ALTER TABLE transactions ADD COLUMN version INT NOT NULL DEFAULT 1; -- PG 11+, instant ALTER TABLE transactions ALTER COLUMN notes SET DEFAULT 'none'; -- instant ALTER TABLE transactions ADD CONSTRAINT amount_positive CHECK (amount >= 0) NOT VALID; -- instant (skip validation) ALTER TABLE transactions VALIDATE CONSTRAINT amount_positive; -- ShareUpdateExclusiveLock CREATE INDEX CONCURRENTLY idx_transactions_sender_new ON transactions (sender); -- no blocking DROP INDEX CONCURRENTLY idx_transactions_sender_old; -- no blocking
The Migration Sequence
Adding a NOT NULL column with a default (PG 10 and earlier):
sql-- Step 1: add nullable (instant) ALTER TABLE transactions ADD COLUMN region TEXT; -- Step 2: backfill in batches (avoid one giant UPDATE that locks) DO $$ DECLARE batch_size INT := 10000; last_id BIGINT := 0; max_id BIGINT; BEGIN SELECT max(id) INTO max_id FROM transactions; WHILE last_id < max_id LOOP UPDATE transactions SET region = 'us-east' WHERE id > last_id AND id <= last_id + batch_size AND region IS NULL; last_id := last_id + batch_size; PERFORM pg_sleep(0.1); -- brief pause between batches to reduce I/O pressure END LOOP; END $$; -- Step 3: add NOT NULL constraint (validates only new rows since NOT VALID) ALTER TABLE transactions ALTER COLUMN region SET NOT NULL; -- PG 11+: this is instant if all rows are non-null; PG 10-: full scan
Renaming a column (zero downtime):
sql-- Step 1: add new column ALTER TABLE transactions ADD COLUMN from_address VARCHAR(66); -- Step 2: sync old to new via trigger (or application dual-write) CREATE OR REPLACE FUNCTION sync_from_address() RETURNS TRIGGER AS $$ BEGIN NEW.from_address := NEW.sender; RETURN NEW; END; $$ LANGUAGE plpgsql; CREATE TRIGGER trg_sync_from_address BEFORE INSERT OR UPDATE ON transactions FOR EACH ROW EXECUTE FUNCTION sync_from_address(); -- Step 3: backfill existing rows UPDATE transactions SET from_address = sender WHERE from_address IS NULL; -- Step 4: deploy application reading new column (verify in staging) -- Step 5: deploy application writing both columns -- Step 6: deploy application writing only new column -- Step 7: drop trigger and old column DROP TRIGGER trg_sync_from_address ON transactions; ALTER TABLE transactions DROP COLUMN sender;
Adding a new index:
sql-- Always use CONCURRENTLY — takes longer but does not block CREATE INDEX CONCURRENTLY idx_transactions_recipient ON transactions (recipient) WHERE recipient IS NOT NULL; -- partial index for non-null values -- Monitor progress SELECT phase, blocks_done, blocks_total, tuples_done, tuples_total FROM pg_stat_progress_create_index WHERE relid = 'transactions'::regclass;
The Five Runbooks
Every production Postgres team needs these written down before they need them. The worst time to write a runbook is during an incident.
Runbook 1: Table Bloat Emergency
Symptoms: table size growing despite stable row count; autovacuum running continuously without catching up; queries slowing down on a table.
Diagnosis:
sql-- Check dead tuple accumulation SELECT relname, n_dead_tup, n_live_tup, ROUND(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 1) AS dead_pct, last_autovacuum FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 10; -- Check for blocking long transactions SELECT pid, usename, state, EXTRACT(EPOCH FROM (now() - xact_start)) AS tx_seconds, LEFT(query, 80) FROM pg_stat_activity WHERE xact_start IS NOT NULL ORDER BY xact_start ASC LIMIT 10; -- Check for blocking replication slots SELECT slot_name, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal FROM pg_replication_slots;
Response:
sql-- Step 1: terminate idle-in-transaction sessions blocking vacuum SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle in transaction' AND xact_start < now() - INTERVAL '5 minutes'; -- Step 2: drop inactive replication slots (after confirming consumer is offline) SELECT pg_drop_replication_slot('stalled_slot'); -- Step 3: run manual vacuum with maximum memory SET maintenance_work_mem = '1GB'; VACUUM (VERBOSE, ANALYZE) transactions; -- Step 4: monitor progress SELECT phase, heap_blks_scanned, heap_blks_total, num_dead_tuples FROM pg_stat_progress_vacuum;
Runbook 2: Replication Lag Spike
Symptoms: pg_stat_replication.replay_lag climbing; standby reads returning stale data; alerting on replication bytes lag.
Diagnosis:
sql-- On primary: check lag breakdown SELECT application_name, state, write_lag, flush_lag, replay_lag, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS total_lag FROM pg_stat_replication; -- Check if autovacuum is the WAL generator SELECT query FROM pg_stat_activity WHERE query LIKE 'autovacuum%' ORDER BY xact_start; -- Check WAL generation rate SELECT pg_size_pretty(wal_bytes) AS total_wal, wal_fpi, wal_records FROM pg_stat_wal;
Response:
sql-- If autovacuum is causing the lag burst: -- Throttle the specific autovacuum (per-table cost delay) ALTER TABLE transactions SET (autovacuum_vacuum_cost_delay = 20); -- If standby is applying slowly (apply_lag >> flush_lag): -- The standby's single-threaded apply is the bottleneck -- Check standby hardware (CPU, I/O) — nothing to tune in Postgres directly -- If network lag (write_lag >> 0): -- Network between primary and standby is saturated -- Check: ifstat, sar -n DEV on both hosts -- Temporary: pause autovacuum on the primary during peak hours ALTER SYSTEM SET autovacuum = off; SELECT pg_reload_conf(); -- Remember to re-enable: ALTER SYSTEM SET autovacuum = on; SELECT pg_reload_conf();
Runbook 3: XID Wraparound Warning
Symptoms: PostgreSQL log shows WARNING: database "X" must be vacuumed within N transactions; monitoring alert on age(relfrozenxid).
Diagnosis:
sql-- Find most at-risk tables SELECT relname, age(relfrozenxid) AS xid_age, pg_size_pretty(pg_total_relation_size(oid)) AS size, last_autovacuum, last_vacuum FROM pg_class WHERE relkind = 'r' ORDER BY age(relfrozenxid) DESC LIMIT 10; -- Check database-level age SELECT datname, age(datfrozenxid) AS db_xid_age FROM pg_database ORDER BY age(datfrozenxid) DESC;
Response:
sql-- Run aggressive freeze vacuum immediately on the oldest table VACUUM FREEZE VERBOSE transactions; -- If autovacuum is already running a wraparound vacuum (check pg_stat_activity) -- Do NOT interrupt it — let it complete -- For the entire database (run during low-traffic window): VACUUM FREEZE; -- Monitor freeze progress SELECT phase, heap_blks_scanned, heap_blks_total, num_index_scans FROM pg_stat_progress_vacuum; -- After vacuum, verify age dropped: SELECT relname, age(relfrozenxid) FROM pg_class WHERE relkind = 'r' ORDER BY age(relfrozenxid) DESC LIMIT 5;
Runbook 4: Connection Exhaustion
Symptoms: application errors FATAL: remaining connection slots are reserved for non-replication superuser connections; new connections rejected; pg_stat_activity shows connections at or near max_connections.
Diagnosis:
sql-- Count connections by state and application SELECT application_name, state, count(*) FROM pg_stat_activity GROUP BY 1, 2 ORDER BY 3 DESC; -- Find connection hogs SELECT usename, application_name, count(*), count(*) FILTER (WHERE state = 'idle') AS idle, count(*) FILTER (WHERE state = 'idle in transaction') AS idle_in_tx, count(*) FILTER (WHERE state = 'active') AS active FROM pg_stat_activity WHERE pid != pg_backend_pid() GROUP BY 1, 2 ORDER BY 3 DESC;
Response:
sql-- Step 1: immediately kill idle connections from the offending application SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE application_name = 'offending_app' AND state = 'idle' AND state_change < now() - INTERVAL '5 minutes'; -- Step 2: kill idle-in-transaction (they are almost certainly stuck) SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle in transaction' AND state_change < now() - INTERVAL '2 minutes'; -- Step 3: temporary emergency measure — reduce max connections per user ALTER ROLE app_user CONNECTION LIMIT 50; -- Revert after fixing the connection leak: ALTER ROLE app_user CONNECTION LIMIT -1; -- Step 4: fix root cause -- Usually: application not closing connections (missing connection.close()) -- Or: connection pool misconfigured (pool size > Postgres max_connections / num_apps) -- Or: connection pool not being used (each request creates a new connection)
Runbook 5: Slow Query During Incident
Symptoms: specific query suddenly slow; pg_stat_activity shows many backends waiting; p99 latency spiking.
Diagnosis:
sql-- Find the slowest currently running queries SELECT pid, usename, ROUND(EXTRACT(EPOCH FROM (now() - query_start))::numeric, 1) AS seconds, wait_event_type, wait_event, LEFT(query, 150) AS query FROM pg_stat_activity WHERE state = 'active' ORDER BY query_start ASC LIMIT 10; -- Get the plan of a running query (without stopping it) SELECT query, query_start, wait_event_type FROM pg_stat_activity WHERE pid = <pid>; -- Check if it is waiting for a lock SELECT * FROM pg_blocking_pids(<pid>); -- Get the actual plan being used (requires pg_stat_statements or auto_explain) SELECT query, calls, mean_exec_time, total_exec_time FROM pg_stat_statements WHERE query LIKE '%transactions%' ORDER BY mean_exec_time DESC LIMIT 5;
Response:
sql-- If a lock is the problem: find and kill the blocking query SELECT pg_terminate_backend(blocking_pid) FROM (SELECT unnest(pg_blocking_pids(<waiting_pid>)) AS blocking_pid) t; -- If stale statistics is the problem (plan changed after data load): ANALYZE transactions; -- If work_mem spill is the problem: -- Session-level increase for the specific connection: ALTER ROLE app_user SET work_mem = '256MB'; -- If the query is genuinely wrong (missing index, bad plan): -- Add the index concurrently (does not block): CREATE INDEX CONCURRENTLY idx_transactions_new ON transactions (column); -- Emergency: cancel a specific long-running query (gentler than terminate) SELECT pg_cancel_backend(<pid>); -- Emergency: kill a stuck connection entirely SELECT pg_terminate_backend(<pid>);
auto_explain: Automatic Plan Logging
For queries that are intermittently slow, auto_explain logs the execution plan automatically when execution time exceeds a threshold.
ini# postgresql.conf shared_preload_libraries = 'pg_stat_statements,auto_explain' auto_explain.log_min_duration = 1000 # log plans for queries > 1 second auto_explain.log_analyze = on # include ANALYZE timing auto_explain.log_buffers = on # include buffer stats auto_explain.log_nested_statements = on # include plans from functions auto_explain.sample_rate = 0.1 # log 10% of qualifying queries (reduce noise)
This is invaluable for diagnosing intermittent slowdowns — you get the actual plan at the time of slowness without having to reproduce it.
The Production Incident: Connection Exhaustion Cascade
Context: A blockchain indexer with 500 max_connections behind PgBouncer in transaction mode.
What happened:
A deployment introduced a bug: under certain error conditions, the application's database connection was not returned to the pool — it was abandoned while holding an open transaction. The transaction stayed idle in transaction.
At normal load: the leak rate was slow enough that connections accumulated over hours without triggering alerts.
At 2:00 AM, a spike in blockchain traffic increased error rate. The leak accelerated. Within 20 minutes:
- PgBouncer's
cl_waitingcounter climbed (application connections queuing for a server connection) - Application timeouts increased (requests waiting for a connection from the pool)
- Application retried timed-out requests (each retry needed its own connection)
- The retry storm caused even more connections to be consumed
- PgBouncer ran out of server connections; applications received
connection refused - Applications retried again — the storm intensified
Peak: 12,000 application connections to PgBouncer, 500 Postgres connections all idle in transaction, zero connections available for legitimate queries.
Resolution sequence:
sql-- Step 1: kill all idle-in-transaction connections (immediate relief) SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle in transaction'; -- 487 connections killed -- Step 2: identify the application sending the most connections SELECT application_name, count(*) FROM pg_stat_activity GROUP BY 1 ORDER BY 2 DESC; -- 'indexer-worker' had 312 connections -- Step 3: limit connections from offending app while fix is deployed ALTER ROLE indexer_worker CONNECTION LIMIT 50; -- Step 4: add the missing timeout to prevent recurrence ALTER SYSTEM SET idle_in_transaction_session_timeout = '30s'; SELECT pg_reload_conf();
The root fix: idle_in_transaction_session_timeout = '30s' was added to postgresql.conf. Any connection that stays idle-in-transaction for more than 30 seconds is now automatically terminated by Postgres. The connection leak bug was fixed in the application, but the timeout ensures any future bug of the same class is self-healing.
Post-incident addition: a PgBouncer alert on cl_waiting > 100 was added — the waiting queue was the earliest visible signal, 15 minutes before the outage became critical.
Summary
| Area | Key Takeaway |
|---|---|
| Monitoring | pg_stat_activity, pg_stat_statements, pg_stat_user_tables, pg_stat_replication — know these cold |
idle_in_transaction_session_timeout | Set it. 30–60 seconds. Prevents an entire class of bloat and connection leak incidents |
statement_timeout | Set it (300s or less). Prevents runaway queries from holding locks or consuming resources indefinitely |
| PgBouncer mode | Transaction mode for OLTP; session mode only if you need prepared statements |
| Pool sizing | (num_cores × 2) + 1 is a reasonable starting point for Postgres connections |
| Backup | pg_dump for portability; pg_basebackup + WAL archive for PITR; pgBackRest for production |
| Backup testing | Restore drills quarterly — an untested backup is not a backup |
| Schema migrations | CREATE INDEX CONCURRENTLY, batched backfills, NOT VALID + VALIDATE pattern |
| Runbooks | Write them before you need them. The worst time to learn these queries is during an incident |
Course Complete
You have now covered the full operational stack of PostgreSQL — from the 8KB page on disk to the replication slot across availability zones. The mental model this course set out to build in Module 0 is now complete.
The path from here:
Practice: run the pageinspect, pgstattuple, and pg_stat_* queries against a real database. The numbers will be different from the examples — that is the point. Learn to read your specific system.
Operationalize: implement the alerting thresholds, write the runbooks, add idle_in_transaction_session_timeout to every production instance you run.
Go deeper: the Postgres source code is readable C. The storage manager, the executor, the planner — all of it is there. backend/storage/buffer/, backend/executor/, backend/optimizer/ are where the concepts in this course live in code.
The database does not lie. Every behavior has a cause in the mechanics described here. When something unexpected happens in production, the answer is in pg_stat_*, in WAL, in the heap pages — you now have the model to find it.