Module 5·49 min read

An index is not a free performance boost. Every index has a write cost, a bloat trajectory, and a planner interaction that can go wrong.

Module 5 — Indexes: B-Tree Internals, GIN, GiST, and When Each One Hurts You

What this module covers: An index is not a free performance boost. Every index has a write cost that compounds with every INSERT, UPDATE, and DELETE. Every index bloats over time. Every index has a specific structure that makes it fast for some query patterns and useless — or actively harmful — for others. This module covers the internal mechanics of every major Postgres index type, how the planner decides whether to use them, and the discipline of indexing for write-heavy production systems.

The Fundamental Trade-off

Before any index mechanics: the decision to add an index is always a trade-off between read speed and write cost.

Every index on a table is an additional data structure that must be kept consistent with the heap. When you insert a row, Postgres inserts into the heap and into every index. When you update a row, Postgres updates every index whose columns changed. When you delete a row, Postgres marks the row dead in the heap and in every index.

On a table with 8 indexes:

An INSERT writes to 9 locations (1 heap + 8 indexes)
An UPDATE that changes 3 indexed columns writes to 7 locations (1 heap old + 1 heap new + 5 unchanged index entries + 2 old index deletes + 2 new index inserts — and WAL for all of it)
A DELETE marks dead in 9 locations

This is before considering the WAL generated for each operation. With full page writes after each checkpoint (Module 3), each of those 9 page modifications can generate 8KB of WAL on first write post-checkpoint.

The discipline of indexing: add indexes for queries that are hot enough to justify the write cost on every insert/update/delete. Remove indexes that are not being used. Be precise about which columns need indexing and in which order.

B-Tree Index Internals

The B-tree is Postgres's default index type. CREATE INDEX without specifying a type creates a B-tree. Understanding it at the page level makes every other B-tree behavior — splits, bloat, planner decisions, ordering — obvious.

The B-Tree Structure

A B-tree index is a balanced tree of fixed-size 8KB pages. There are three kinds of pages:

Meta page (page 0): Contains the root page pointer and fast-root pointer. The root is where all searches start.

Internal pages (branch nodes): Contain key-pointer pairs. Each entry holds an index key value and a pointer to the child page where values ≤ that key live. Internal pages do not contain heap TIDs — they are navigation only.

Leaf pages: Contain index entries, each being a key value + heap TID (page number + offset). Leaf pages are linked in a doubly-linked list in sorted order — this is what makes range scans efficient. You find the first matching leaf entry, then follow the right-sibling pointer without traversing the tree again.

sql
-- Inspect a B-tree's structure using pageinspect
CREATE EXTENSION pageinspect;

-- See the meta page
SELECT * FROM bt_metap('idx_transactions_block_height');
-- root | level | fastroot | ...
-- 412  |   3   |   412   | ...  ← 3 levels deep for a large index

-- See internal page contents
SELECT itemoffset, ctid, itemlen, nulls, vars, data
FROM bt_page_items('idx_transactions_block_height', 412)
LIMIT 10;

B-Tree Searches

A point lookup (e.g., WHERE block_height = 18500050) traverses from root to leaf:

Read root page → find the internal entry where 18500050 falls → get child pointer
Read internal page → find the next child pointer
Read leaf page → find entries with block_height = 18500050, extract TIDs
For each TID: heap fetch (check visibility, return row)

For a 3-level tree with 100M rows, this is 3 page reads + 1 heap read = 4 random I/Os per row. With shared_buffers warmed up, those pages are in cache and the lookup is pure CPU.

Range Scans and the Leaf Chain

For WHERE block_height BETWEEN 18500000 AND 18500100:

Tree traversal to find the first leaf entry ≥ 18500000
Scan right along the leaf chain, collecting TIDs
Sort TIDs (or not, depending on the access pattern)
Fetch each heap page

The leaf chain scan is sequential — efficient. The heap fetches are random — potentially expensive if the matching rows are spread across many heap pages. This is where correlation matters (covered below).

Page Splits: The Source of B-Tree Bloat

When a leaf page fills up, Postgres splits it: the existing entries are divided between the current page and a new page, and a pointer to the new page is inserted into the parent. If the parent is also full, it splits too — cascading upward.

The right-growth optimization: When inserts are strictly sequential (e.g., a BIGSERIAL primary key), Postgres detects this and always splits to the right — new pages are appended rather than split. This produces densely-packed pages with no wasted space. Sequential primary keys are cheaper to maintain in indexes than random ones.

Random inserts cause 50% fill factor after splits. By default, Postgres splits a full page roughly in half. After the split, both pages are ~50% full. On a random-insert workload, pages frequently hover between 50% and 100% full — wasting up to 50% of index storage.

Setting a lower fillfactor leaves room for updates without splits:

sql
-- Create index with 70% fill factor — 30% reserved for in-page updates
CREATE INDEX idx_transactions_status
ON transactions (status)
WITH (fillfactor = 70);

For indexes on frequently-updated columns, a lower fillfactor reduces page splits at the cost of larger index size. For append-only tables, use the default (90) or even higher.

HOT Updates and Index Overhead

Heap Only Tuple (HOT) updates are a critical optimization. When you update a row and the updated columns are NOT indexed, Postgres can avoid updating any indexes:

Old tuple is marked dead in the heap
New tuple is written to the same heap page (if space allows)
The old tuple's ctid pointer chains to the new tuple
No index entries are modified

HOT updates generate significantly less WAL and avoid index bloat. They only work when:

The updated columns are not part of any index
The new tuple fits on the same heap page as the old tuple

sql
-- Check HOT update rate for a table
SELECT
  relname,
  n_tup_upd,
  n_tup_hot_upd,
  ROUND(100.0 * n_tup_hot_upd / NULLIF(n_tup_upd, 0), 1) AS hot_pct
FROM pg_stat_user_tables
WHERE relname = 'transactions';

-- Low hot_pct = most updates are touching indexed columns
-- High hot_pct = good — HOT is working, index maintenance is minimal

If hot_pct is low on a high-update table, audit which columns are being updated versus which are indexed. Sometimes removing an index on a column that is updated frequently (but not queried with it) dramatically improves write throughput.

Index Correlation

Correlation measures how well the physical order of rows in the heap matches the logical order of values in an index. It ranges from -1 (perfectly reversed) to +1 (perfectly aligned).

sql
-- Check correlation for indexed columns
SELECT
  attname,
  correlation
FROM pg_stats
WHERE tablename = 'transactions'
  AND attname IN ('id', 'block_height', 'timestamp', 'sender');

-- Example output:
-- id            | 0.9998  ← nearly perfect sequential, index scans are cheap
-- block_height  | 0.8342  ← good correlation, range scans efficient
-- timestamp     | -0.0023 ← random order, range scans cause many heap fetches
-- sender        | 0.0011  ← random, each index hit = separate heap page read

When correlation is low, an index scan fetches rows from scattered heap pages — potentially one random I/O per row. For large result sets, the planner may choose a sequential scan instead, because reading heap pages sequentially is more efficient than random fetches.

This is why CLUSTER (which rewrites the table in index order) can dramatically speed up range queries on low-correlation columns — but it's a full table rewrite and acquires an exclusive lock.

Multi-Column Indexes

A multi-column (composite) index on (a, b) stores entries sorted by a first, then b within each a value.

sql
CREATE INDEX idx_transactions_height_status
ON transactions (block_height, status);

This index efficiently supports:

WHERE block_height = X — uses the index (leftmost column)
WHERE block_height = X AND status = Y — uses both columns
WHERE block_height BETWEEN X AND Y — range on leading column
ORDER BY block_height, status — sorted output

It does not efficiently support:

WHERE status = Y — cannot use the index without the leading column
ORDER BY status — not sorted by trailing column alone

The leading column rule: The index is only usable when the query predicates include the leftmost column(s). A query on status alone cannot use idx_transactions_height_status.

Column order matters for range queries: Put equality predicates first, range predicates last.

sql
-- Query: WHERE block_height BETWEEN X AND Y AND status = 'confirmed'
-- Best index order: (status, block_height) — equality first, range second
CREATE INDEX idx_transactions_status_height
ON transactions (status, block_height);

-- With (block_height, status): scans entire height range, filters status
-- With (status, block_height): scans only 'confirmed' rows in height range

Partial Indexes

A partial index indexes only the rows that match a WHERE condition. This is one of the most underused features in Postgres.

sql
-- Instead of indexing all 10M transactions, index only pending ones
-- If 95% are 'confirmed', this is ~50x smaller than a full index
CREATE INDEX idx_transactions_pending
ON transactions (created_at)
WHERE status = 'pending';

-- The planner uses this index when the query includes the predicate
SELECT * FROM transactions
WHERE status = 'pending'
  AND created_at < now() - INTERVAL '10 minutes';
-- → uses idx_transactions_pending

Partial indexes:

Are smaller (fewer entries → fewer pages → more cache hits)
Have lower write cost (only insert/delete when the condition is met)
Can be more selective (same selectivity over a smaller row set)

The pattern for status-filtered tables: If 95% of rows are in a terminal state (confirmed, failed) and 5% are active (pending), a partial index on active rows is 20x smaller than a full index and much more cache-friendly.

sql
-- Unique constraint only on active rows (allow multiple 'completed' per user)
CREATE UNIQUE INDEX idx_active_session_per_user
ON sessions (user_id)
WHERE status = 'active';

Expression Indexes

An expression index indexes the result of a function or expression, not a raw column value.

sql
-- Case-insensitive email lookups
CREATE INDEX idx_users_email_lower
ON users (lower(email));

-- The planner uses this when the query uses the same expression
SELECT * FROM users WHERE lower(email) = lower('User@Example.com');

-- Without the expression index, this query does a full scan even with
-- an index on email (because lower(email) ≠ email for the optimizer)

sql
-- Index on a JSONB field without a GIN index
CREATE INDEX idx_transactions_gas_used
ON transactions ((payload->>'gas_used')::bigint);

SELECT * FROM transactions
WHERE (payload->>'gas_used')::bigint > 100000;

Expression indexes increase write cost because Postgres must evaluate the expression for every inserted/updated row and store the result in the index.

GIN Indexes: For Containment Queries

Generalized Inverted Index (GIN) is designed for data types where a single column value contains multiple "elements" — arrays, JSONB documents, tsvectors for full-text search.

How GIN Works

A GIN index maintains a posting list structure:

Key: each distinct element (array element, JSONB key, lexeme)
Value: the list of heap TIDs that contain this key

For a JSONB column, GIN indexes every key and value in every document. For an array column, it indexes every array element.

sql
-- GIN on JSONB for containment queries
CREATE INDEX idx_transactions_payload_gin
ON transactions USING GIN (payload);

-- Now these queries use the index:
SELECT * FROM transactions WHERE payload @> '{"type": "Transfer"}';
SELECT * FROM transactions WHERE payload ? 'gas_used';
SELECT * FROM transactions WHERE payload @> '{"events": [{"type": "Transfer"}]}';

GIN Operators

Operator	Meaning	Example
`@>`	Contains	`payload @> '{"type": "Transfer"}'`
`<@`	Contained by	`'{"a":1}' <@ payload`
`?`	Key exists	`payload ? 'gas_used'`
`?	`	Any key exists
`?&`	All keys exist	`payload ?& ARRAY['a','b']`
`@@`	Full-text match	`to_tsvector(body) @@ plainto_tsquery('search')`

GIN Write Cost: The Pending List

GIN indexes have a significantly higher write cost than B-tree because inserting one row requires updating the posting list for every element in the document. A JSONB document with 20 keys generates 20+ index updates per INSERT.

To mitigate this, GIN uses a pending list: new entries are first written to a small pending list on disk, and then merged into the main GIN structure in bulk during VACUUM or when the pending list fills gin_pending_list_limit (default 4MB).

ini
gin_pending_list_limit = 4MB   # how large the pending list can grow before forced merge

This pending list is great for write throughput but means GIN indexes may not reflect very recent changes until the pending list is merged. For most use cases this is fine.

sql
-- Force pending list merge (useful after bulk loads)
SELECT gin_clean_pending_list('idx_transactions_payload_gin');

jsonb_path_ops vs jsonb_ops

sql
-- Default: indexes all keys AND values (larger, more flexible)
CREATE INDEX ON transactions USING GIN (payload);           -- jsonb_ops

-- Optimized: indexes only values for containment (@>) queries (smaller, faster)
CREATE INDEX ON transactions USING GIN (payload jsonb_path_ops);

-- jsonb_path_ops does NOT support ? (key exists) queries
-- Use jsonb_ops if you need both containment and key-exists
-- Use jsonb_path_ops if you only need @>

jsonb_path_ops produces a smaller index (indexes only paths, not individual keys) and is faster for @> queries. The trade-off: it does not support ?, ?|, ?& operators.

GiST Indexes: For Geometric and Range Queries

Generalized Search Tree (GiST) is an extensible index framework that supports spatial queries, range type overlap, and nearest-neighbor searches. Unlike B-tree (exact ordering) and GIN (containment), GiST supports geometric predicates.

sql
-- Range type overlap
CREATE TABLE bookings (
  id BIGSERIAL,
  room_id INT,
  reserved_during TSTZRANGE
);

CREATE INDEX idx_bookings_range ON bookings USING GiST (reserved_during);

-- Find overlapping bookings
SELECT * FROM bookings
WHERE reserved_during && '[2026-05-17, 2026-05-20)';

-- Exclusion constraint using GiST (prevents overlapping bookings for same room)
ALTER TABLE bookings ADD CONSTRAINT no_overlap
EXCLUDE USING GiST (room_id WITH =, reserved_during WITH &&);

GiST for Full-Text Search

GiST can also index tsvector (full-text search vectors), but GIN is almost always preferred for full-text — GIN is faster for lookup, GiST is faster for writes.

sql
-- GiST for full-text (use when write speed matters more than read speed)
CREATE INDEX ON documents USING GiST (to_tsvector('english', body));

-- GIN for full-text (use when read speed matters more)
CREATE INDEX ON documents USING GIN (to_tsvector('english', body));

GiST for Nearest-Neighbor (KNN)

GiST supports KNN searches using the <-> distance operator:

sql
-- PostGIS example: find nearest locations
CREATE INDEX ON locations USING GiST (geom);

SELECT name, geom <-> ST_MakePoint(-122.4194, 37.7749) AS distance
FROM locations
ORDER BY geom <-> ST_MakePoint(-122.4194, 37.7749)
LIMIT 10;
-- Uses GiST index for KNN — efficient without scanning all rows

BRIN Indexes: For Naturally Ordered Large Tables

Block Range INdex (BRIN) is fundamentally different from B-tree and GIN. Instead of indexing individual values, it stores the min and max value for each range of heap pages (a "block range", default 128 pages = 1MB).

sql
CREATE INDEX idx_transactions_timestamp_brin
ON transactions USING BRIN (timestamp)
WITH (pages_per_range = 128);

BRIN is useful when the column value is correlated with physical insertion order — i.e., newer rows have later timestamps, and they are inserted at the end of the heap. This is true for:

Auto-incrementing IDs (BIGSERIAL)
Timestamps with append-only insert patterns
Block heights in a blockchain indexer

sql
-- Query benefits from BRIN
SELECT * FROM transactions
WHERE timestamp > now() - INTERVAL '1 day';
-- BRIN eliminates all block ranges where max(timestamp) < threshold
-- For a 10M row table, this might eliminate 90% of pages with a 1KB index

BRIN vs B-tree: When to Choose

	B-Tree	BRIN
Size	Large (proportional to row count)	Tiny (proportional to page count)
Write cost	Per-row update	Per-block-range, lazy
Read speed	Fast for any selectivity	Fast only with high correlation
Random data	Efficient	Useless
Sequential data	Works, but BRIN is cheaper	Ideal

The rule: use BRIN on timestamp or sequential ID columns in append-only tables when cardinality is high and correlation is high. Use B-tree everywhere else that needs fast point lookup.

A BRIN index on a 100M-row table with 128 pages per range is roughly 100M / 128 / (8KB/entry) ≈ 100KB. A B-tree on the same column is hundreds of megabytes. The BRIN query is slower (must scan all pages in matching ranges), but for large time-range filters it eliminates the vast majority of pages.

Hash Indexes

Hash indexes store a hash of the indexed value and are only useful for equality comparisons (=). They are smaller than B-tree for equality workloads but do not support range queries, ordering, or NULL.

sql
CREATE INDEX idx_transactions_hash_hash
ON transactions USING HASH (hash);

-- Only useful for: WHERE hash = '\x...'
-- Useless for: WHERE hash LIKE '\x12%', ORDER BY hash, range queries

In practice, B-tree handles equality lookups nearly as fast as hash indexes, and B-tree is more versatile. Hash indexes are a niche optimization for very high cardinality equality columns (like UUIDs or hashes) where the planner otherwise might not use the index effectively.

Index-Only Scans

When all columns needed by a query are in the index, Postgres can satisfy the query without touching the heap at all — an index-only scan.

sql
-- Index on (block_height, status, timestamp)
CREATE INDEX idx_transactions_composite
ON transactions (block_height, status, timestamp);

-- This query needs only those three columns:
SELECT block_height, status, timestamp
FROM transactions
WHERE block_height BETWEEN 18500000 AND 18500100;
-- → Index Only Scan (no heap fetch)

Index-only scans are fast because they avoid random heap I/O. But they require the Visibility Map to confirm tuple visibility — if the VM bit for a page is not set (page has not been vacuumed recently), Postgres must fetch the heap page anyway to check visibility.

sql
-- See how many heap fetches an index-only scan is actually making
EXPLAIN (ANALYZE, BUFFERS)
SELECT block_height, status, timestamp
FROM transactions
WHERE block_height BETWEEN 18500000 AND 18500100;

-- Look for:
-- Index Only Scan using idx_transactions_composite
-- Heap Fetches: 0   ← perfect — VM bits are set, no heap reads
-- Heap Fetches: 847 ← VM bits not set, autovacuum hasn't run recently

If heap fetches are high on an index-only scan, run VACUUM on the table to set VM bits. After vacuum, heap fetches should drop to zero.

Covering Indexes: `INCLUDE`

Postgres 11+ supports INCLUDE columns — columns stored in the index leaf pages but not part of the sort key. This enables index-only scans without affecting the index sort order.

sql
-- Index sorted by (block_height, status), with timestamp stored but not sorted
CREATE INDEX idx_transactions_height_covering
ON transactions (block_height, status)
INCLUDE (timestamp, sender);

-- Query: needs block_height, status, timestamp, sender
SELECT block_height, status, timestamp, sender
FROM transactions
WHERE block_height = 18500050
  AND status = 'confirmed';
-- → Index Only Scan — all four columns available in index leaf pages

INCLUDE columns increase index size but don't affect the B-tree structure or split behavior. Use them when you have a frequently-used query that needs just a few extra columns beyond the index key.

When Indexes Hurt: The Cases Most Engineers Miss

1. Low-Selectivity Columns

An index on a column with few distinct values (low cardinality) is often useless or counterproductive.

sql
-- status has 3 values: 'confirmed' (95%), 'pending' (4%), 'failed' (1%)
CREATE INDEX idx_transactions_status ON transactions (status);

-- Query for 'confirmed' rows: planner won't use this index
SELECT * FROM transactions WHERE status = 'confirmed';
-- 95% of rows match — sequential scan is faster than 9.5M random heap fetches

The planner estimates that fetching 95% of rows via index is more expensive than a sequential scan. It will choose the sequential scan. The index is wasted disk space and write overhead.

The fix: partial index on the selective values only:

sql
CREATE INDEX idx_transactions_pending ON transactions (created_at) WHERE status = 'pending';
CREATE INDEX idx_transactions_failed ON transactions (created_at) WHERE status = 'failed';

2. Too Many Indexes on a Write-Heavy Table

sql
-- A table with 12 indexes on a blockchain transactions table
\d transactions
-- Indexes:
--  idx_transactions_pkey           (id)
--  idx_transactions_hash           (hash)
--  idx_transactions_block_height   (block_height)
--  idx_transactions_sender         (sender)
--  idx_transactions_recipient      (recipient)
--  idx_transactions_status         (status)
--  idx_transactions_timestamp      (timestamp)
--  idx_transactions_created_at     (created_at)
--  idx_transactions_payload_gin    (payload) GIN
--  idx_transactions_sender_height  (sender, block_height)
--  idx_transactions_height_status  (block_height, status)
--  idx_transactions_covering       (block_height, status) INCLUDE (timestamp)

At 2,000 inserts/second, each insert touches all 12 index structures. That's 24,000 index page modifications per second, each potentially generating WAL. This is the hidden write amplification that only becomes visible when write throughput starts saturating I/O.

Audit unused indexes:

sql
-- Find indexes that have never been used since last stats reset
SELECT
  schemaname,
  relname AS table,
  indexrelname AS index,
  pg_size_pretty(pg_relation_size(indexrelid)) AS size,
  idx_scan AS scans_since_reset
FROM pg_stat_user_indexes
JOIN pg_index USING (indexrelid)
WHERE idx_scan = 0
  AND NOT indisprimary
  AND NOT indisunique
ORDER BY pg_relation_size(indexrelid) DESC;

Drop indexes with idx_scan = 0 that have been around long enough to capture representative traffic (reset stats after major traffic pattern changes with SELECT pg_stat_reset()).

3. Indexes That Are Never Used by the Planner

An index exists but the planner refuses to use it — not because it's wrong, but because the cost estimate says sequential scan is cheaper.

sql
EXPLAIN SELECT * FROM transactions WHERE block_height > 18000000;
-- Seq Scan on transactions (cost=0.00..285432.00 rows=9500000 ...)
-- Filter: (block_height > 18000000)

If block_height > 18000000 matches 95% of rows, the planner is correct — sequential scan is faster. The index exists but is never used. It costs write overhead on every insert with no read benefit.

Force the planner to show you what it thinks:

sql
-- Check estimated selectivity
SELECT
  n_distinct,
  correlation,
  most_common_vals,
  most_common_freqs,
  histogram_bounds
FROM pg_stats
WHERE tablename = 'transactions'
  AND attname = 'block_height';

4. Indexes That Break HOT Updates

Any index on a column that is frequently updated prevents HOT updates on those updates, forcing full index maintenance.

sql
-- If 'last_seen_at' is updated every time a user logs in:
CREATE INDEX idx_users_last_seen ON users (last_seen_at);

-- Every login now does a full index delete + insert instead of HOT update
-- Remove this index if 'last_seen_at' is not used in WHERE/ORDER BY queries

5. Unused Unique Indexes Beyond the Constraint

Unique indexes are often added for constraint enforcement, which is valid. But sometimes engineers add unique indexes "just in case" without realizing the constraint enforcement itself already creates an index:

sql
-- This creates an implicit unique index:
ALTER TABLE transactions ADD CONSTRAINT transactions_hash_key UNIQUE (hash);

-- Adding this is redundant — same index already exists:
CREATE UNIQUE INDEX idx_transactions_hash ON transactions (hash); -- DUPLICATE

Check for duplicate indexes (same columns, same order) and drop the redundant ones.

The Planner and Indexes: When Good Indexes Are Ignored

Statistics Freshness

The planner decides whether to use an index based on statistics. If statistics are stale (bulk load without ANALYZE, or autovacuum hasn't run), the planner may misestimate selectivity and choose the wrong plan.

sql
-- After any bulk load or major data change:
ANALYZE transactions;

-- Check when statistics were last updated:
SELECT
  attname,
  n_distinct,
  correlation,
  most_common_vals
FROM pg_stats
WHERE tablename = 'transactions';

`enable_indexscan` and Plan Forcing

sql
-- Force the planner to use (or avoid) index scans for debugging
SET enable_indexscan = off;       -- disable index scans
SET enable_bitmapscan = off;      -- disable bitmap index scans
SET enable_seqscan = off;         -- disable sequential scans (forces index use)

-- Never use these in production — diagnostic only

Bitmap Index Scans

When a query matches many rows via an index, Postgres may use a bitmap index scan instead of a regular index scan:

Scan the index, building a bitmap of matching heap page numbers (not individual TIDs)
Sort the page numbers
Read heap pages in sorted order (sequential-ish access pattern)

This is more efficient than random-order index scan when many rows match. It trades the precise TID ordering for coarser page-level ordering, which reduces random I/O.

sql
EXPLAIN SELECT * FROM transactions WHERE block_height BETWEEN 18500000 AND 18510000;
-- Bitmap Heap Scan on transactions
--   Recheck Cond: (block_height BETWEEN 18500000 AND 18510000)
--   ->  Bitmap Index Scan on idx_transactions_block_height
--         Index Cond: (block_height BETWEEN 18500000 AND 18510000)

Multiple bitmap index scans can be ANDed or ORed together:

sql
EXPLAIN SELECT * FROM transactions
WHERE block_height BETWEEN 18500000 AND 18510000
  AND status = 'pending';
-- BitmapAnd
--   Bitmap Index Scan on idx_transactions_block_height
--   Bitmap Index Scan on idx_transactions_pending

Production Index Audit Workflow

sql
-- 1. Find large unused indexes
SELECT
  relname AS table,
  indexrelname AS index,
  pg_size_pretty(pg_relation_size(indexrelid)) AS size,
  idx_scan,
  idx_tup_read,
  idx_tup_fetch
FROM pg_stat_user_indexes
JOIN pg_indexes ON indexrelname = indexname AND relname = tablename
WHERE idx_scan < 100
  AND pg_relation_size(indexrelid) > 10000000
ORDER BY pg_relation_size(indexrelid) DESC;

-- 2. Find duplicate indexes (same column set)
SELECT
  indrelid::regclass AS table,
  array_agg(indexrelid::regclass) AS indexes,
  array_agg(indkey) AS column_sets
FROM pg_index
GROUP BY indrelid, indkey
HAVING count(*) > 1;

-- 3. Check index bloat
SELECT
  indexrelname,
  pg_size_pretty(pg_relation_size(indexrelid)) AS size,
  idx_scan
FROM pg_stat_user_indexes
JOIN pg_index USING (indexrelid)
ORDER BY pg_relation_size(indexrelid) DESC
LIMIT 20;

-- 4. Find missing indexes (tables with sequential scans on large tables)
SELECT
  relname,
  seq_scan,
  seq_tup_read,
  idx_scan,
  n_live_tup,
  pg_size_pretty(pg_total_relation_size(relid)) AS size
FROM pg_stat_user_tables
WHERE seq_scan > idx_scan
  AND n_live_tup > 100000
ORDER BY seq_tup_read DESC
LIMIT 20;

Production Incident: Index Bloat Causing 3x Query Regression

Context: A transactions table with a B-tree index on (sender, block_height) used for API queries like "show me all transactions from sender X in the last 10,000 blocks."

What happened:

The table processed roughly 500 updates/second to the status column. These updates were NOT HOT updates because an index on status existed from an earlier investigation. The index on (sender, block_height) was fine — but each non-HOT update to status caused a full index page access in the sender index as well, because the new tuple needed a new entry even though sender hadn't changed.

After 6 months, the (sender, block_height) index had grown to 8GB from its original 2GB. Average leaf page density had dropped to 45% from the constant churn of dead tuple entries.

sql
-- Diagnosis
SELECT * FROM pgstatindex('idx_transactions_sender_height');
-- leaf_pages: 1048576
-- avg_leaf_density: 44.8
-- leaf_fragmentation: 38.2

The query WHERE sender = '0xabc...' AND block_height > 18400000 now scanned 2x as many leaf pages as necessary, because each page was only 45% full. Query time went from 2ms to 6ms — imperceptible in isolation, but 3x slower and visible in p99 latency.

The fix:

sql
-- Step 1: Remove the unused status index (cause of non-HOT updates)
DROP INDEX idx_transactions_status;

-- Step 2: Reindex without locking (Postgres 12+)
REINDEX INDEX CONCURRENTLY idx_transactions_sender_height;

-- Step 3: Verify HOT update rate improved
SELECT
  n_tup_upd,
  n_tup_hot_upd,
  ROUND(100.0 * n_tup_hot_upd / NULLIF(n_tup_upd, 0), 1) AS hot_pct
FROM pg_stat_user_tables
WHERE relname = 'transactions';
-- hot_pct went from 12% to 94% after dropping the status index

After the reindex, query time returned to 2ms. The REINDEX INDEX CONCURRENTLY took 40 minutes on the 8GB index but did not lock reads or writes during the rebuild.

Summary

Index Type	Best For	Avoid When
B-tree	Equality, range, ordering, most queries	Very low cardinality columns
GIN	Array containment, JSONB `@>`, full-text	Write-heavy columns (high write cost)
GiST	Ranges, spatial data, KNN	Simple equality or range on scalars
BRIN	Sequential, append-only timestamp/ID columns	Low-correlation or random data
Hash	High-cardinality equality only	Anything needing range or ordering
Partial	Selective subsets of rows	When most rows match the condition
Expression	Function-based lookups	Frequently updated expressions
Covering (INCLUDE)	Index-only scan for extra columns	When included columns are large

Core rules:

Every index has a write cost — measure it, don't ignore it
Remove indexes the planner isn't using (idx_scan = 0)
Low-cardinality columns need partial indexes, not full indexes
HOT updates are killed by indexes on updated columns — audit before adding
Check correlation before relying on B-tree range scans
BRIN is underused — add it to sequential timestamp/ID columns
After bulk loads, always ANALYZE — stale stats cause good indexes to be ignored

Indexes determine how the planner accesses data. Module 6 covers how the planner actually makes those decisions — the cost model, statistics, join strategies, and how to diagnose every class of plan regression.

Next: Module 6 — Query Planning and Execution: How Postgres Decides What to Do With Your Query →

PreviousModule 4: Autovacuum: The Process Everyone Misconfigures Next Module 6: Query Planning and Execution: How Postgres Decides What to Do With Your Query