Module 6·39 min read

The planner is a cost-based optimizer. Every wrong plan has a root cause in statistics, configuration, or schema design.

Module 6 — Query Planning and Execution: How Postgres Decides What to Do With Your Query

What this module covers: The query planner is a cost-based optimizer. It does not know the right answer — it estimates the cost of many possible plans and picks the cheapest one based on statistics. Every wrong plan has a root cause: stale statistics, a bad cardinality estimate, a misconfigured cost parameter, or a schema design that gives the planner no good options. This module gives you the mental model to read any EXPLAIN output, identify what went wrong, and fix it.

The Planner's Job

When you execute a query, the planner receives a query tree (the semantically validated parse output from Module 0) and must produce an execution plan — a tree of physical operations that, when executed, returns the correct result.

For a non-trivial query, hundreds or thousands of valid execution plans exist. Different join orders, different index choices, different aggregation strategies — all produce the same correct result with wildly different costs.

The planner's job is to find the plan with the lowest estimated cost, fast enough that planning time does not become a bottleneck itself.

It does this by:

Generating candidate plans using dynamic programming
Estimating the cost of each plan using a statistical model
Emitting the lowest-cost plan to the executor

Every word in step 2 is load-bearing: estimating (not measuring), based on a statistical model (not reality). Plan quality is bounded by statistics quality. This is the single most important thing to understand about the Postgres planner.

The Cost Model

Cost Units

Postgres measures plan cost in arbitrary cost units that approximate I/O and CPU work. The absolute numbers are meaningless — only relative comparisons between plans matter.

The cost parameters that define what one unit means:

ini
seq_page_cost    = 1.0   # cost of reading one page sequentially
random_page_cost = 4.0   # cost of reading one page randomly (I/O seek)
cpu_tuple_cost   = 0.01  # cost of processing one row
cpu_index_tuple_cost = 0.005  # cost of processing one index entry
cpu_operator_cost = 0.0025    # cost of evaluating one operator

random_page_cost = 4.0 is the most important tuning parameter for most systems.

The default of 4.0 was calibrated for spinning disks where random reads cost ~4x more than sequential reads. On SSDs, random reads cost 1.1–2x sequential reads. On NVMe with a warm OS cache, they're nearly equal.

If your storage is SSD and random_page_cost is still at 4.0, the planner overestimates the cost of index scans and chooses sequential scans when it shouldn't.

ini
# For SSD storage
random_page_cost = 1.5

# For NVMe with large shared_buffers (mostly cache hits)
random_page_cost = 1.1

This single change often fixes "the planner won't use my index" problems on modern hardware.

Startup vs Total Cost

Every plan node reports two costs: (startup_cost..total_cost).

Startup cost: cost before the first row is returned
Total cost: cost to return all rows

Sort  (cost=12453.84..12703.84 rows=100000 width=76)
  ->  Seq Scan on transactions  (cost=0.00..5330.00 rows=100000 width=76)

The Sort node has high startup cost because it must materialize all input rows before returning any output. The Seq Scan has zero startup cost — it starts returning rows immediately.

For LIMIT queries, the planner prefers plans with low startup cost because only the first N rows need to be produced:

sql
SELECT * FROM transactions ORDER BY timestamp DESC LIMIT 10;

The planner may choose an index scan on timestamp (which delivers rows in sorted order immediately, low startup cost) over a sequential scan + sort (which has high startup cost from sorting, even if lower total cost).

This is the LIMIT optimization: low startup cost beats low total cost when not all rows are needed.

Statistics: The Foundation of Every Estimate

What Postgres Collects

ANALYZE (or autovacuum's analyze phase) collects statistics about each column and stores them in pg_statistic. The human-readable view is pg_stats.

sql
SELECT
  attname,
  n_distinct,
  correlation,
  null_frac,
  avg_width,
  most_common_vals,
  most_common_freqs,
  histogram_bounds
FROM pg_stats
WHERE tablename = 'transactions'
  AND attname = 'status';

For a status column with values ['confirmed', 'pending', 'failed']:

n_distinct:        3
null_frac:         0
most_common_vals:  {confirmed, pending, failed}
most_common_freqs: {0.952, 0.039, 0.009}

The planner knows: status = 'confirmed' matches 95.2% of rows, status = 'pending' matches 3.9%. This is how it decides whether an index on status is worth using for a given query.

For numeric columns like block_height:

n_distinct:       -0.95         ← negative means ~95% of rows have distinct values
correlation:       0.834
histogram_bounds: {18000000, 18050000, 18100000, ...}

The histogram divides the value range into equal-frequency buckets. Each bucket boundary is a value where roughly 1/statistics_target of rows fall below it. The planner uses linear interpolation within buckets.

`statistics_target`: The Detail Level

default_statistics_target (default: 100) controls how many histogram buckets and most-common-values are collected. More buckets = better estimates for selective predicates = better plans = slower ANALYZE.

sql
-- Check current statistics target for a column
SELECT attstattarget FROM pg_attribute
WHERE attrelid = 'transactions'::regclass
  AND attname = 'block_height';
-- -1 means use default_statistics_target

-- Increase for a column with complex distribution
ALTER TABLE transactions
ALTER COLUMN block_height SET STATISTICS 500;

ANALYZE transactions;

Increase statistics_target for columns where:

The planner makes bad cardinality estimates (check estimated vs actual rows in EXPLAIN ANALYZE)
The column has many distinct values with uneven distribution
The column is used in range predicates with wide selectivity variance

The cost: ANALYZE takes longer and pg_statistic uses more space. The benefit: better plans.

Extended Statistics: Multi-Column Correlations

The planner assumes columns are statistically independent. When they are not, estimates for multi-column predicates are wrong.

sql
-- city and zip_code are not independent — every city has specific zip codes
SELECT count(*) FROM addresses WHERE city = 'San Francisco' AND zip_code = '94105';
-- Planner estimate: P(city='SF') * P(zip='94105') * row_count (assumes independence)
-- Actual: all '94105' rows are in 'SF' — the estimate is orders of magnitude off

Extended statistics capture multi-column dependencies:

sql
-- Create statistics that capture city/zip correlation
CREATE STATISTICS addr_city_zip (dependencies, ndistinct, mcv)
ON city, zip_code FROM addresses;

ANALYZE addresses;

-- Now the planner uses the joint distribution instead of assuming independence

Three kinds of extended statistics:

dependencies — captures functional dependencies (zip implies city)
ndistinct — captures combined distinct count for GROUP BY estimates
mcv — most-common-value combinations for complex predicate estimates

Extended statistics is one of the most underused planner improvements available.

Scan Nodes: How Rows Are Retrieved

Sequential Scan

Reads every page of the heap in order. Always correct. Often the right choice for large result sets or low-selectivity predicates.

Seq Scan on transactions  (cost=0.00..285432.00 rows=10000000 width=148)
  Filter: (block_height > 18000000)
  Rows Removed by Filter: 523847

Rows Removed by Filter tells you how much work the filter is doing post-scan. High numbers with a filter that could be served by an index = potential missing index.

Index Scan

Traverses the B-tree to find matching TIDs, then fetches heap pages in TID order (random access).

Index Scan using idx_transactions_block_height on transactions
  (cost=0.57..8.59 rows=1 width=148) (actual time=0.032..0.034 rows=1 loops=1)
  Index Cond: (block_height = 18500050)
  Buffers: shared hit=4

Buffers: shared hit=4 — 4 page reads, all from shared_buffers cache. 3 B-tree pages + 1 heap page. This is the optimal case.

Bitmap Index Scan + Bitmap Heap Scan

For queries matching many rows, a bitmap scan is more efficient than a pure index scan:

Bitmap Heap Scan on transactions  (cost=1823.44..18234.55 rows=90000 width=148)
  Recheck Cond: (block_height BETWEEN 18500000 AND 18510000)
  ->  Bitmap Index Scan on idx_transactions_block_height
        Index Cond: (block_height BETWEEN 18500000 AND 18510000)

The bitmap index scan builds an in-memory bitmap of matching heap pages. The bitmap heap scan then reads those pages in heap order (reducing random I/O). When the bitmap is too large for work_mem, it becomes lossy — the Recheck Cond re-evaluates the predicate at the heap level.

Index Only Scan

No heap access — all data comes from the index leaf pages.

Index Only Scan using idx_transactions_composite on transactions
  (cost=0.57..4.59 rows=1 width=24) (actual time=0.018..0.019 rows=1 loops=1)
  Index Cond: (block_height = 18500050)
  Heap Fetches: 0

Heap Fetches: 0 — perfect. VM bits are set, no heap reads needed.

Join Strategies

Joins are where plan complexity explodes. For N tables, there are N! possible join orders, and each join can use one of three strategies.

Nested Loop Join

For each row in the outer relation, scan the inner relation for matching rows.

Nested Loop  (cost=0.57..16.62 rows=5 width=200)
  ->  Index Scan on blocks b  (cost=0.42..8.44 rows=1 width=52)
        Index Cond: (b.height = 18500050)
  ->  Index Scan on transactions t  (cost=0.57..8.12 rows=5 width=148)
        Index Cond: (t.block_height = b.height)

Cost: O(outer_rows × inner_lookup_cost).

Nested loop is efficient when:

The outer relation is small (few rows to iterate)
The inner relation has an index on the join key (cheap inner lookup)
The result set is small

It is catastrophic when:

The outer relation is large
The inner relation has no index (degenerates to O(outer × inner) full scans)

Hash Join

Build a hash table from the smaller relation, probe it with every row from the larger relation.

Hash Join  (cost=2850.00..18450.00 rows=500000 width=200)
  Hash Cond: (t.block_height = b.height)
  ->  Seq Scan on transactions t  (cost=0.00..285432.00 rows=10000000 width=148)
  ->  Hash  (cost=1823.00..1823.00 rows=82240 width=52)
        Buckets: 131072  Batches: 1  Memory Usage: 6432kB
        ->  Seq Scan on blocks b  (cost=0.00..1823.00 rows=82240 width=52)

Memory Usage: 6432kB — the hash table fits in memory (within work_mem). When it doesn't fit, Batches > 1 means the hash table is written to disk in multiple passes — significantly slower.

Hash  (cost=...)
  Buckets: 32768  Batches: 4  Memory Usage: 4096kB
-- Batches: 4 means the hash spilled to disk 4 times — increase work_mem

Hash join is efficient when:

One relation is small enough to hash in memory
The join key has high cardinality (few hash collisions)
No index is available for a nested loop

Merge Join

Both relations are sorted on the join key; the sorted streams are merged.

Merge Join  (cost=0.99..125432.00 rows=500000 width=200)
  Merge Cond: (t.block_height = b.height)
  ->  Index Scan using idx_transactions_block_height on transactions t
  ->  Index Scan using idx_blocks_height on blocks b

Merge join is efficient when:

Both relations have indexes that produce sorted output (zero sort cost)
The join is on the full equality condition
Both result sets are large (hash join's memory limit is a concern)

When both indexes exist, merge join can be extremely fast — it reads both sorted streams linearly with no random access.

Join Order and `join_collapse_limit`

For queries joining N tables, the planner evaluates join orders up to join_collapse_limit (default: 8) tables exhaustively. Beyond that, it uses a greedy heuristic.

sql
-- Planner evaluates all join orders for ≤8 tables
-- For 9+ tables it uses GEQO (genetic query optimizer)
SET join_collapse_limit = 1;  -- respect the explicit JOIN order in the query
SET join_collapse_limit = 8;  -- default: planner chooses order

For complex reporting queries with many joins, the planner may choose a suboptimal order. You can hint the order by writing CTEs (which pre-materialize) or by using SET join_collapse_limit = 1 with explicit join order.

Aggregation Strategies

HashAggregate

Build a hash table keyed by GROUP BY columns, accumulate aggregate values.

HashAggregate  (cost=35432.00..35532.00 rows=100 width=16)
  Group Key: status
  Batches: 1  Memory Usage: 40kB
  ->  Seq Scan on transactions  (cost=0.00..285432.00 rows=10000000 width=4)

Batches: 1 = hash table fits in memory. Good. Batches > 1 = hash table spilled to disk. Increase work_mem.

GroupAggregate

Requires sorted input, aggregates by scanning sorted groups.

GroupAggregate  (cost=0.57..18234.00 rows=3 width=16)
  Group Key: status
  ->  Index Scan using idx_transactions_status on transactions

GroupAggregate uses less memory (no hash table) but requires sorted input. If an index provides the sort, GroupAggregate can be cheaper. If a sort node is needed, HashAggregate is usually better.

Reading EXPLAIN ANALYZE Output: A Complete Walkthrough

sql
EXPLAIN (ANALYZE, BUFFERS, WAL, FORMAT TEXT)
SELECT
  b.height,
  count(t.id) AS tx_count,
  sum(t.amount) AS total_amount
FROM blocks b
JOIN transactions t ON t.block_height = b.height
WHERE b.height BETWEEN 18500000 AND 18500100
  AND t.status = 'confirmed'
GROUP BY b.height
ORDER BY b.height;

Sort  (cost=1823.44..1824.19 rows=101 width=24)
      (actual time=18.432..18.445 rows=101 loops=1)
  Sort Key: b.height
  Sort Method: quicksort  Memory: 33kB
  Buffers: shared hit=892 read=47
  ->  HashAggregate  (cost=1816.93..1818.43 rows=101 width=24)
                     (actual time=18.312..18.389 rows=101 loops=1)
        Group Key: b.height
        Batches: 1  Memory Usage: 56kB
        Buffers: shared hit=892 read=47
        ->  Hash Join  (cost=8.44..1814.43 rows=500 width=16)
                       (actual time=0.312..17.832 rows=5050 loops=1)
              Hash Cond: (t.block_height = b.height)
              Buffers: shared hit=892 read=47
              ->  Bitmap Heap Scan on transactions t
                    (cost=5.19..1782.12 rows=5000 width=16)
                    (actual time=0.089..16.432 rows=5050 loops=1)
                    Recheck Cond: (status = 'confirmed')
                    Filter: (block_height BETWEEN 18500000 AND 18500100)
                    Rows Removed by Filter: 892
                    Heap Blocks: exact=847
                    Buffers: shared hit=850 read=47
                    ->  Bitmap Index Scan on idx_transactions_pending
                          (cost=0.00..3.94 rows=5942 width=0)
                          (actual time=0.052..0.053 rows=5942 loops=1)
                          Index Cond: (status = 'confirmed')
                          Buffers: shared hit=42
              ->  Hash  (cost=2.01..2.01 rows=101 width=8)
                        (actual time=0.198..0.199 rows=101 loops=1)
                    Buckets: 1024  Batches: 1  Memory Usage: 12kB
                    ->  Index Scan using idx_blocks_height on blocks b
                          (cost=0.42..2.01 rows=101 width=8)
                          (actual time=0.011..0.141 rows=101 loops=1)
                          Index Cond: (height BETWEEN 18500000 AND 18500100)
                          Buffers: shared hit=42
Planning Time: 1.832 ms
Execution Time: 18.621 ms

Reading this tree bottom-up:

Index Scan on blocks — reads 101 blocks for the height range. 42 buffer hits, 0 disk reads. Fast.
Hash — builds a hash table from 101 blocks. 12kB, fits easily in memory.
Bitmap Index Scan on idx_transactions_pending — scans the status index for 'confirmed' rows. Returns 5,942 TIDs. Notice this uses a partial index named _pending but scans for 'confirmed' — the index name is misleading, it's actually the full status index. 42 buffer hits.
Bitmap Heap Scan on transactions — fetches 5,942 TIDs from the heap. Rows Removed by Filter: 892 — 892 rows matched the status index but failed the block_height range filter. 847 heap blocks read, 47 from disk (not in shared_buffers).
Hash Join — probes the blocks hash table with each transaction row. 5,050 rows matched.
HashAggregate — groups by b.height, computes count and sum. 56kB memory, no spill.
Sort — sorts the 101 group results by height. 33kB quicksort in memory.

What to check in this output:

rows=5000 estimated vs rows=5050 actual — excellent estimate (1% error)
Batches: 1 for both hash nodes — no disk spill
read=47 in the heap scan — 47 pages not in shared_buffers. Acceptable for this query size
Rows Removed by Filter: 892 — 892 rows matched the index but failed the range filter. A composite index (status, block_height) would eliminate these heap fetches

Diagnosing Plan Regressions

The Estimate vs Actual Gap

The most reliable signal of a plan problem is a large gap between estimated and actual row counts.

Seq Scan on transactions  (cost=0.00..285432.00 rows=10 width=148)
                          (actual time=0.100..4832.100 rows=9500000 loops=1)

The planner estimated 10 rows. 9.5 million actual rows were scanned. Every downstream plan decision was made assuming 10 rows — join strategy, aggregation strategy, memory allocation — all wrong.

Root causes:

Stale statistics — run ANALYZE
Statistics target too low — increase statistics_target for the column
Multi-column correlation — create extended statistics
Complex function in predicate — planner has no statistics on function output

Stale Statistics After Bulk Load

sql
-- Pattern: query was fast, bulk load happened, query is now slow
-- EXPLAIN shows planner estimated X rows, actual is 100X

-- Fix: always ANALYZE after bulk loads
INSERT INTO transactions SELECT ... FROM staging;
ANALYZE transactions;  -- before any queries run

Planner Choosing Sequential Scan Over Available Index

sql
EXPLAIN SELECT * FROM transactions WHERE sender = '0xabc...';
-- Seq Scan on transactions  (cost=0.00..285432.00 rows=2500000 width=148)
-- Filter: (sender = '0xabc...')

If an index on sender exists and isn't being used:

Check selectivity — SELECT count(*) FROM transactions WHERE sender = '0xabc...'. If it's 25% of the table, sequential scan is correct.
Check random_page_cost — if it's 4.0 on SSD, reduce to 1.5.
Check statistics — SELECT most_common_freqs FROM pg_stats WHERE attname = 'sender'. Is the estimate accurate?
Check correlation — SELECT correlation FROM pg_stats WHERE attname = 'sender'. Low correlation = index scan fetches scattered heap pages.

sql
-- Force index use for debugging (never in production)
SET enable_seqscan = off;
EXPLAIN SELECT * FROM transactions WHERE sender = '0xabc...';
-- Now shows index plan and its estimated cost
-- Compare to actual cost when forced

Work_mem and Sort/Hash Spills

sql
-- Check for disk spills
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM transactions ORDER BY timestamp;

-- Look for:
-- Sort Method: external merge  Disk: 524288kB  ← sorting to disk
-- Batches: 8  ← hash join/aggregate spilling

-- Fix:
SET work_mem = '256MB';  -- session-level increase for specific queries
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM transactions ORDER BY timestamp;
-- Sort Method: quicksort  Memory: 182340kB  ← now in memory

work_mem is per-sort-operation per-backend. Increasing it globally risks OOM on concurrent workloads. Increase it session-level for known expensive queries, or use a connection pool that sets it per query type.

Parallel Query

Postgres supports parallel execution for sequential scans, aggregations, joins, and index scans (Postgres 10+).

Gather  (cost=1000.00..145432.00 rows=10000000 width=148)
  Workers Planned: 4
  Workers Launched: 4
  ->  Parallel Seq Scan on transactions
        (cost=0.00..144432.00 rows=2500000 width=148)
        Workers: 4

Workers Planned: 4 but Workers Launched: 4 — all workers started. If launched < planned, check max_parallel_workers and max_parallel_workers_per_gather.

Parallel Query Tuning

ini
# Maximum parallel workers across the whole server
max_parallel_workers = 8

# Maximum workers for a single gather node
max_parallel_workers_per_gather = 4

# Minimum table size before considering parallel scan
min_parallel_table_scan_size = 8MB

# Cost reduction for parallel plans (higher = more likely to parallelize)
parallel_tuple_cost = 0.1
parallel_setup_cost = 1000.0

Parallel query helps for large sequential scans and aggregations. It does not help for index scans on OLTP queries (they're already fast). Enabling too much parallelism on an OLTP workload can hurt throughput by exhausting workers needed for concurrent connections.

Plan Caching and Prepared Statements

Postgres caches execution plans for prepared statements — but the caching behavior changed significantly in PG 12.

Generic vs Custom Plans

For the first 5 executions of a prepared statement, Postgres uses a custom plan — re-planned with the actual parameter values.

After 5 executions, Postgres compares the average cost of custom plans to the estimated cost of a generic plan (planned without seeing specific parameter values). If the generic plan is not significantly worse, it caches and reuses the generic plan.

sql
PREPARE get_txns (varchar) AS
  SELECT * FROM transactions WHERE sender = $1 LIMIT 100;

-- Executions 1-5: custom plans with actual sender value
EXECUTE get_txns('0xabc...');

-- Execution 6+: possibly switches to generic plan
-- Generic plan for sender: might choose seq scan (sender has low cardinality globally)
-- But for your specific '0xabc...', an index scan would be better

-- Check which plan type is being used:
EXPLAIN EXECUTE get_txns('0xabc...');

When a generic plan is wrong for certain parameter values, you can force custom plans:

sql
-- Force custom plans always (re-plan every execution)
SET plan_cache_mode = force_custom_plan;

-- Force generic plan always (dangerous if parameters vary widely)
SET plan_cache_mode = force_generic_plan;

-- Default: automatic switching
SET plan_cache_mode = auto;

In PgBouncer session mode, prepared statements are visible to Postgres and cached. In transaction mode, prepared statements are not preserved — every execution goes through the 5-run warm-up again.

The `pg_stat_statements` Extension

pg_stat_statements is the most important diagnostic extension for query performance. It tracks statistics for every distinct query shape executed.

sql
CREATE EXTENSION pg_stat_statements;

-- Top 10 queries by total execution time
SELECT
  LEFT(query, 80) AS query,
  calls,
  ROUND(total_exec_time::numeric, 2) AS total_ms,
  ROUND(mean_exec_time::numeric, 2) AS mean_ms,
  ROUND(stddev_exec_time::numeric, 2) AS stddev_ms,
  rows,
  shared_blks_hit,
  shared_blks_read,
  blk_read_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

sql
-- Queries with high cache miss rate (blks_read / total blks)
SELECT
  LEFT(query, 80) AS query,
  calls,
  shared_blks_hit + shared_blks_read AS total_blks,
  ROUND(100.0 * shared_blks_read /
    NULLIF(shared_blks_hit + shared_blks_read, 0), 1) AS cache_miss_pct,
  ROUND(mean_exec_time::numeric, 2) AS mean_ms
FROM pg_stat_statements
WHERE shared_blks_hit + shared_blks_read > 1000
ORDER BY cache_miss_pct DESC
LIMIT 10;

sql
-- Queries with high row-estimation error (requires pg_stat_statements v1.9+)
SELECT
  LEFT(query, 80) AS query,
  calls,
  ROUND(mean_exec_time::numeric, 2) AS mean_ms,
  rows / NULLIF(calls, 0) AS avg_actual_rows
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 20;

Reset stats after schema changes or tuning to get a clean baseline:

sql
SELECT pg_stat_statements_reset();

Production Incident: Plan Regression After Partition Pruning Failure

Context: A blockchain indexer partitioned the transactions table by block_height ranges, each partition covering 500,000 blocks.

The query:

sql
SELECT * FROM transactions
WHERE block_height = 18500050
  AND sender = '0xabc...';

What happened:

After a schema refactor, a developer changed the query to pass block_height as a string parameter from the application layer:

sql
-- Application was sending:
WHERE block_height = '18500050'  -- string, not integer

Postgres coerced the string to bigint for comparison — but the coercion prevented partition pruning. Instead of scanning only the transactions_18500000_19000000 partition, the planner scanned all 36 partitions.

Before the change: actual time=0.8ms, rows=3 After the change: actual time=14200ms, rows=3

sql
-- EXPLAIN showed:
Append
  ->  Seq Scan on transactions_17500000_18000000
        Filter: (block_height = '18500050'::bigint)
        Rows Removed by Filter: 500000
  ->  Seq Scan on transactions_18000000_18500000
  -- ... 34 more partitions

The fix:

sql
-- Fix 1: cast explicitly in application (correct type)
WHERE block_height = 18500050::bigint

-- Fix 2: verify partition pruning is working
EXPLAIN SELECT * FROM transactions WHERE block_height = 18500050;
-- Should show: Seq Scan on transactions_18500000_19000000 (single partition)
-- Not: Append with all partitions

The lesson: partition pruning requires the predicate type to match the partition key type exactly. Type coercions — even implicit ones — can silently disable pruning and turn a 1ms query into a 14-second full table scan.

Summary

Concept	Key Takeaway
Cost model	Arbitrary units; `random_page_cost = 4.0` is wrong for SSD — set to 1.1–1.5
Statistics	Planner estimates are only as good as `pg_stats`. Run `ANALYZE` after bulk loads.
`statistics_target`	Increase for high-cardinality columns with bad cardinality estimates
Extended statistics	Capture multi-column correlations with `CREATE STATISTICS`
Estimate vs actual	Large divergence = root cause of every plan regression
Nested loop	Fast when outer is small and inner has index; catastrophic otherwise
Hash join	Best for large joins; spills to disk when `work_mem` is too small
Merge join	Optimal when both sides have sorted index output
`work_mem`	Per-sort-operation; spills cause `external merge` / `Batches > 1`
Parallel query	Useful for large scans; configure `max_parallel_workers_per_gather`
Plan cache	Generic plans can be wrong for skewed parameters; `plan_cache_mode` controls this
`pg_stat_statements`	The first extension you install on any production database

Understanding how the planner makes decisions is only half the picture. Module 7 covers the schema design decisions made before the planner even runs — the ones that are nearly impossible to reverse at scale.

Next: Module 7 — Schema Design at Scale: Decisions That Cannot Be Undone →

PreviousModule 5: Indexes: B-Tree Internals, GIN, GiST, and When Each One Hurts You Next Module 7: Schema Design at Scale: Decisions That Cannot Be Undone