An index is not a free performance boost. Every index has a write cost, a bloat trajectory, and a planner interaction that can go wrong.
Module 5 — Indexes: B-Tree Internals, GIN, GiST, and When Each One Hurts You
What this module covers: An index is not a free performance boost. Every index has a write cost that compounds with every INSERT, UPDATE, and DELETE. Every index bloats over time. Every index has a specific structure that makes it fast for some query patterns and useless — or actively harmful — for others. This module covers the internal mechanics of every major Postgres index type, how the planner decides whether to use them, and the discipline of indexing for write-heavy production systems.
The Fundamental Trade-off
Before any index mechanics: the decision to add an index is always a trade-off between read speed and write cost.
Every index on a table is an additional data structure that must be kept consistent with the heap. When you insert a row, Postgres inserts into the heap and into every index. When you update a row, Postgres updates every index whose columns changed. When you delete a row, Postgres marks the row dead in the heap and in every index.
On a table with 8 indexes:
- An
INSERTwrites to 9 locations (1 heap + 8 indexes) - An
UPDATEthat changes 3 indexed columns writes to 7 locations (1 heap old + 1 heap new + 5 unchanged index entries + 2 old index deletes + 2 new index inserts — and WAL for all of it) - A
DELETEmarks dead in 9 locations
This is before considering the WAL generated for each operation. With full page writes after each checkpoint (Module 3), each of those 9 page modifications can generate 8KB of WAL on first write post-checkpoint.
The discipline of indexing: add indexes for queries that are hot enough to justify the write cost on every insert/update/delete. Remove indexes that are not being used. Be precise about which columns need indexing and in which order.
B-Tree Index Internals
The B-tree is Postgres's default index type. CREATE INDEX without specifying a type creates a B-tree. Understanding it at the page level makes every other B-tree behavior — splits, bloat, planner decisions, ordering — obvious.
The B-Tree Structure
A B-tree index is a balanced tree of fixed-size 8KB pages. There are three kinds of pages:
Meta page (page 0): Contains the root page pointer and fast-root pointer. The root is where all searches start.
Internal pages (branch nodes): Contain key-pointer pairs. Each entry holds an index key value and a pointer to the child page where values ≤ that key live. Internal pages do not contain heap TIDs — they are navigation only.
Leaf pages: Contain index entries, each being a key value + heap TID (page number + offset). Leaf pages are linked in a doubly-linked list in sorted order — this is what makes range scans efficient. You find the first matching leaf entry, then follow the right-sibling pointer without traversing the tree again.
sql-- Inspect a B-tree's structure using pageinspect CREATE EXTENSION pageinspect; -- See the meta page SELECT * FROM bt_metap('idx_transactions_block_height'); -- root | level | fastroot | ... -- 412 | 3 | 412 | ... ← 3 levels deep for a large index -- See internal page contents SELECT itemoffset, ctid, itemlen, nulls, vars, data FROM bt_page_items('idx_transactions_block_height', 412) LIMIT 10;
B-Tree Searches
A point lookup (e.g., WHERE block_height = 18500050) traverses from root to leaf:
- Read root page → find the internal entry where
18500050falls → get child pointer - Read internal page → find the next child pointer
- Read leaf page → find entries with
block_height = 18500050, extract TIDs - For each TID: heap fetch (check visibility, return row)
For a 3-level tree with 100M rows, this is 3 page reads + 1 heap read = 4 random I/Os per row. With shared_buffers warmed up, those pages are in cache and the lookup is pure CPU.
Range Scans and the Leaf Chain
For WHERE block_height BETWEEN 18500000 AND 18500100:
- Tree traversal to find the first leaf entry ≥ 18500000
- Scan right along the leaf chain, collecting TIDs
- Sort TIDs (or not, depending on the access pattern)
- Fetch each heap page
The leaf chain scan is sequential — efficient. The heap fetches are random — potentially expensive if the matching rows are spread across many heap pages. This is where correlation matters (covered below).
Page Splits: The Source of B-Tree Bloat
When a leaf page fills up, Postgres splits it: the existing entries are divided between the current page and a new page, and a pointer to the new page is inserted into the parent. If the parent is also full, it splits too — cascading upward.
The right-growth optimization: When inserts are strictly sequential (e.g., a BIGSERIAL primary key), Postgres detects this and always splits to the right — new pages are appended rather than split. This produces densely-packed pages with no wasted space. Sequential primary keys are cheaper to maintain in indexes than random ones.
Random inserts cause 50% fill factor after splits. By default, Postgres splits a full page roughly in half. After the split, both pages are ~50% full. On a random-insert workload, pages frequently hover between 50% and 100% full — wasting up to 50% of index storage.
Setting a lower fillfactor leaves room for updates without splits:
sql-- Create index with 70% fill factor — 30% reserved for in-page updates CREATE INDEX idx_transactions_status ON transactions (status) WITH (fillfactor = 70);
For indexes on frequently-updated columns, a lower fillfactor reduces page splits at the cost of larger index size. For append-only tables, use the default (90) or even higher.
HOT Updates and Index Overhead
Heap Only Tuple (HOT) updates are a critical optimization. When you update a row and the updated columns are NOT indexed, Postgres can avoid updating any indexes:
- Old tuple is marked dead in the heap
- New tuple is written to the same heap page (if space allows)
- The old tuple's
ctidpointer chains to the new tuple - No index entries are modified
HOT updates generate significantly less WAL and avoid index bloat. They only work when:
- The updated columns are not part of any index
- The new tuple fits on the same heap page as the old tuple
sql-- Check HOT update rate for a table SELECT relname, n_tup_upd, n_tup_hot_upd, ROUND(100.0 * n_tup_hot_upd / NULLIF(n_tup_upd, 0), 1) AS hot_pct FROM pg_stat_user_tables WHERE relname = 'transactions'; -- Low hot_pct = most updates are touching indexed columns -- High hot_pct = good — HOT is working, index maintenance is minimal
If hot_pct is low on a high-update table, audit which columns are being updated versus which are indexed. Sometimes removing an index on a column that is updated frequently (but not queried with it) dramatically improves write throughput.
Index Correlation
Correlation measures how well the physical order of rows in the heap matches the logical order of values in an index. It ranges from -1 (perfectly reversed) to +1 (perfectly aligned).
sql-- Check correlation for indexed columns SELECT attname, correlation FROM pg_stats WHERE tablename = 'transactions' AND attname IN ('id', 'block_height', 'timestamp', 'sender'); -- Example output: -- id | 0.9998 ← nearly perfect sequential, index scans are cheap -- block_height | 0.8342 ← good correlation, range scans efficient -- timestamp | -0.0023 ← random order, range scans cause many heap fetches -- sender | 0.0011 ← random, each index hit = separate heap page read
When correlation is low, an index scan fetches rows from scattered heap pages — potentially one random I/O per row. For large result sets, the planner may choose a sequential scan instead, because reading heap pages sequentially is more efficient than random fetches.
This is why CLUSTER (which rewrites the table in index order) can dramatically speed up range queries on low-correlation columns — but it's a full table rewrite and acquires an exclusive lock.
Multi-Column Indexes
A multi-column (composite) index on (a, b) stores entries sorted by a first, then b within each a value.
sqlCREATE INDEX idx_transactions_height_status ON transactions (block_height, status);
This index efficiently supports:
WHERE block_height = X— uses the index (leftmost column)WHERE block_height = X AND status = Y— uses both columnsWHERE block_height BETWEEN X AND Y— range on leading columnORDER BY block_height, status— sorted output
It does not efficiently support:
WHERE status = Y— cannot use the index without the leading columnORDER BY status— not sorted by trailing column alone
The leading column rule: The index is only usable when the query predicates include the leftmost column(s). A query on status alone cannot use idx_transactions_height_status.
Column order matters for range queries: Put equality predicates first, range predicates last.
sql-- Query: WHERE block_height BETWEEN X AND Y AND status = 'confirmed' -- Best index order: (status, block_height) — equality first, range second CREATE INDEX idx_transactions_status_height ON transactions (status, block_height); -- With (block_height, status): scans entire height range, filters status -- With (status, block_height): scans only 'confirmed' rows in height range
Partial Indexes
A partial index indexes only the rows that match a WHERE condition. This is one of the most underused features in Postgres.
sql-- Instead of indexing all 10M transactions, index only pending ones -- If 95% are 'confirmed', this is ~50x smaller than a full index CREATE INDEX idx_transactions_pending ON transactions (created_at) WHERE status = 'pending'; -- The planner uses this index when the query includes the predicate SELECT * FROM transactions WHERE status = 'pending' AND created_at < now() - INTERVAL '10 minutes'; -- → uses idx_transactions_pending
Partial indexes:
- Are smaller (fewer entries → fewer pages → more cache hits)
- Have lower write cost (only insert/delete when the condition is met)
- Can be more selective (same selectivity over a smaller row set)
The pattern for status-filtered tables: If 95% of rows are in a terminal state (confirmed, failed) and 5% are active (pending), a partial index on active rows is 20x smaller than a full index and much more cache-friendly.
sql-- Unique constraint only on active rows (allow multiple 'completed' per user) CREATE UNIQUE INDEX idx_active_session_per_user ON sessions (user_id) WHERE status = 'active';
Expression Indexes
An expression index indexes the result of a function or expression, not a raw column value.
sql-- Case-insensitive email lookups CREATE INDEX idx_users_email_lower ON users (lower(email)); -- The planner uses this when the query uses the same expression SELECT * FROM users WHERE lower(email) = lower('User@Example.com'); -- Without the expression index, this query does a full scan even with -- an index on email (because lower(email) ≠ email for the optimizer)
sql-- Index on a JSONB field without a GIN index CREATE INDEX idx_transactions_gas_used ON transactions ((payload->>'gas_used')::bigint); SELECT * FROM transactions WHERE (payload->>'gas_used')::bigint > 100000;
Expression indexes increase write cost because Postgres must evaluate the expression for every inserted/updated row and store the result in the index.
GIN Indexes: For Containment Queries
Generalized Inverted Index (GIN) is designed for data types where a single column value contains multiple "elements" — arrays, JSONB documents, tsvectors for full-text search.
How GIN Works
A GIN index maintains a posting list structure:
- Key: each distinct element (array element, JSONB key, lexeme)
- Value: the list of heap TIDs that contain this key
For a JSONB column, GIN indexes every key and value in every document. For an array column, it indexes every array element.
sql-- GIN on JSONB for containment queries CREATE INDEX idx_transactions_payload_gin ON transactions USING GIN (payload); -- Now these queries use the index: SELECT * FROM transactions WHERE payload @> '{"type": "Transfer"}'; SELECT * FROM transactions WHERE payload ? 'gas_used'; SELECT * FROM transactions WHERE payload @> '{"events": [{"type": "Transfer"}]}';
GIN Operators
| Operator | Meaning | Example |
|---|---|---|
@> | Contains | payload @> '{"type": "Transfer"}' |
<@ | Contained by | '{"a":1}' <@ payload |
? | Key exists | payload ? 'gas_used' |
| `? | ` | Any key exists |
?& | All keys exist | payload ?& ARRAY['a','b'] |
@@ | Full-text match | to_tsvector(body) @@ plainto_tsquery('search') |
GIN Write Cost: The Pending List
GIN indexes have a significantly higher write cost than B-tree because inserting one row requires updating the posting list for every element in the document. A JSONB document with 20 keys generates 20+ index updates per INSERT.
To mitigate this, GIN uses a pending list: new entries are first written to a small pending list on disk, and then merged into the main GIN structure in bulk during VACUUM or when the pending list fills gin_pending_list_limit (default 4MB).
inigin_pending_list_limit = 4MB # how large the pending list can grow before forced merge
This pending list is great for write throughput but means GIN indexes may not reflect very recent changes until the pending list is merged. For most use cases this is fine.
sql-- Force pending list merge (useful after bulk loads) SELECT gin_clean_pending_list('idx_transactions_payload_gin');
jsonb_path_ops vs jsonb_ops
sql-- Default: indexes all keys AND values (larger, more flexible) CREATE INDEX ON transactions USING GIN (payload); -- jsonb_ops -- Optimized: indexes only values for containment (@>) queries (smaller, faster) CREATE INDEX ON transactions USING GIN (payload jsonb_path_ops); -- jsonb_path_ops does NOT support ? (key exists) queries -- Use jsonb_ops if you need both containment and key-exists -- Use jsonb_path_ops if you only need @>
jsonb_path_ops produces a smaller index (indexes only paths, not individual keys) and is faster for @> queries. The trade-off: it does not support ?, ?|, ?& operators.
GiST Indexes: For Geometric and Range Queries
Generalized Search Tree (GiST) is an extensible index framework that supports spatial queries, range type overlap, and nearest-neighbor searches. Unlike B-tree (exact ordering) and GIN (containment), GiST supports geometric predicates.
sql-- Range type overlap CREATE TABLE bookings ( id BIGSERIAL, room_id INT, reserved_during TSTZRANGE ); CREATE INDEX idx_bookings_range ON bookings USING GiST (reserved_during); -- Find overlapping bookings SELECT * FROM bookings WHERE reserved_during && '[2026-05-17, 2026-05-20)'; -- Exclusion constraint using GiST (prevents overlapping bookings for same room) ALTER TABLE bookings ADD CONSTRAINT no_overlap EXCLUDE USING GiST (room_id WITH =, reserved_during WITH &&);
GiST for Full-Text Search
GiST can also index tsvector (full-text search vectors), but GIN is almost always preferred for full-text — GIN is faster for lookup, GiST is faster for writes.
sql-- GiST for full-text (use when write speed matters more than read speed) CREATE INDEX ON documents USING GiST (to_tsvector('english', body)); -- GIN for full-text (use when read speed matters more) CREATE INDEX ON documents USING GIN (to_tsvector('english', body));
GiST for Nearest-Neighbor (KNN)
GiST supports KNN searches using the <-> distance operator:
sql-- PostGIS example: find nearest locations CREATE INDEX ON locations USING GiST (geom); SELECT name, geom <-> ST_MakePoint(-122.4194, 37.7749) AS distance FROM locations ORDER BY geom <-> ST_MakePoint(-122.4194, 37.7749) LIMIT 10; -- Uses GiST index for KNN — efficient without scanning all rows
BRIN Indexes: For Naturally Ordered Large Tables
Block Range INdex (BRIN) is fundamentally different from B-tree and GIN. Instead of indexing individual values, it stores the min and max value for each range of heap pages (a "block range", default 128 pages = 1MB).
sqlCREATE INDEX idx_transactions_timestamp_brin ON transactions USING BRIN (timestamp) WITH (pages_per_range = 128);
BRIN is useful when the column value is correlated with physical insertion order — i.e., newer rows have later timestamps, and they are inserted at the end of the heap. This is true for:
- Auto-incrementing IDs (
BIGSERIAL) - Timestamps with append-only insert patterns
- Block heights in a blockchain indexer
sql-- Query benefits from BRIN SELECT * FROM transactions WHERE timestamp > now() - INTERVAL '1 day'; -- BRIN eliminates all block ranges where max(timestamp) < threshold -- For a 10M row table, this might eliminate 90% of pages with a 1KB index
BRIN vs B-tree: When to Choose
| B-Tree | BRIN | |
|---|---|---|
| Size | Large (proportional to row count) | Tiny (proportional to page count) |
| Write cost | Per-row update | Per-block-range, lazy |
| Read speed | Fast for any selectivity | Fast only with high correlation |
| Random data | Efficient | Useless |
| Sequential data | Works, but BRIN is cheaper | Ideal |
The rule: use BRIN on timestamp or sequential ID columns in append-only tables when cardinality is high and correlation is high. Use B-tree everywhere else that needs fast point lookup.
A BRIN index on a 100M-row table with 128 pages per range is roughly 100M / 128 / (8KB/entry) ≈ 100KB. A B-tree on the same column is hundreds of megabytes. The BRIN query is slower (must scan all pages in matching ranges), but for large time-range filters it eliminates the vast majority of pages.
Hash Indexes
Hash indexes store a hash of the indexed value and are only useful for equality comparisons (=). They are smaller than B-tree for equality workloads but do not support range queries, ordering, or NULL.
sqlCREATE INDEX idx_transactions_hash_hash ON transactions USING HASH (hash); -- Only useful for: WHERE hash = '\x...' -- Useless for: WHERE hash LIKE '\x12%', ORDER BY hash, range queries
In practice, B-tree handles equality lookups nearly as fast as hash indexes, and B-tree is more versatile. Hash indexes are a niche optimization for very high cardinality equality columns (like UUIDs or hashes) where the planner otherwise might not use the index effectively.
Index-Only Scans
When all columns needed by a query are in the index, Postgres can satisfy the query without touching the heap at all — an index-only scan.
sql-- Index on (block_height, status, timestamp) CREATE INDEX idx_transactions_composite ON transactions (block_height, status, timestamp); -- This query needs only those three columns: SELECT block_height, status, timestamp FROM transactions WHERE block_height BETWEEN 18500000 AND 18500100; -- → Index Only Scan (no heap fetch)
Index-only scans are fast because they avoid random heap I/O. But they require the Visibility Map to confirm tuple visibility — if the VM bit for a page is not set (page has not been vacuumed recently), Postgres must fetch the heap page anyway to check visibility.
sql-- See how many heap fetches an index-only scan is actually making EXPLAIN (ANALYZE, BUFFERS) SELECT block_height, status, timestamp FROM transactions WHERE block_height BETWEEN 18500000 AND 18500100; -- Look for: -- Index Only Scan using idx_transactions_composite -- Heap Fetches: 0 ← perfect — VM bits are set, no heap reads -- Heap Fetches: 847 ← VM bits not set, autovacuum hasn't run recently
If heap fetches are high on an index-only scan, run VACUUM on the table to set VM bits. After vacuum, heap fetches should drop to zero.
Covering Indexes: INCLUDE
Postgres 11+ supports INCLUDE columns — columns stored in the index leaf pages but not part of the sort key. This enables index-only scans without affecting the index sort order.
sql-- Index sorted by (block_height, status), with timestamp stored but not sorted CREATE INDEX idx_transactions_height_covering ON transactions (block_height, status) INCLUDE (timestamp, sender); -- Query: needs block_height, status, timestamp, sender SELECT block_height, status, timestamp, sender FROM transactions WHERE block_height = 18500050 AND status = 'confirmed'; -- → Index Only Scan — all four columns available in index leaf pages
INCLUDE columns increase index size but don't affect the B-tree structure or split behavior. Use them when you have a frequently-used query that needs just a few extra columns beyond the index key.
When Indexes Hurt: The Cases Most Engineers Miss
1. Low-Selectivity Columns
An index on a column with few distinct values (low cardinality) is often useless or counterproductive.
sql-- status has 3 values: 'confirmed' (95%), 'pending' (4%), 'failed' (1%) CREATE INDEX idx_transactions_status ON transactions (status); -- Query for 'confirmed' rows: planner won't use this index SELECT * FROM transactions WHERE status = 'confirmed'; -- 95% of rows match — sequential scan is faster than 9.5M random heap fetches
The planner estimates that fetching 95% of rows via index is more expensive than a sequential scan. It will choose the sequential scan. The index is wasted disk space and write overhead.
The fix: partial index on the selective values only:
sqlCREATE INDEX idx_transactions_pending ON transactions (created_at) WHERE status = 'pending'; CREATE INDEX idx_transactions_failed ON transactions (created_at) WHERE status = 'failed';
2. Too Many Indexes on a Write-Heavy Table
sql-- A table with 12 indexes on a blockchain transactions table \d transactions -- Indexes: -- idx_transactions_pkey (id) -- idx_transactions_hash (hash) -- idx_transactions_block_height (block_height) -- idx_transactions_sender (sender) -- idx_transactions_recipient (recipient) -- idx_transactions_status (status) -- idx_transactions_timestamp (timestamp) -- idx_transactions_created_at (created_at) -- idx_transactions_payload_gin (payload) GIN -- idx_transactions_sender_height (sender, block_height) -- idx_transactions_height_status (block_height, status) -- idx_transactions_covering (block_height, status) INCLUDE (timestamp)
At 2,000 inserts/second, each insert touches all 12 index structures. That's 24,000 index page modifications per second, each potentially generating WAL. This is the hidden write amplification that only becomes visible when write throughput starts saturating I/O.
Audit unused indexes:
sql-- Find indexes that have never been used since last stats reset SELECT schemaname, relname AS table, indexrelname AS index, pg_size_pretty(pg_relation_size(indexrelid)) AS size, idx_scan AS scans_since_reset FROM pg_stat_user_indexes JOIN pg_index USING (indexrelid) WHERE idx_scan = 0 AND NOT indisprimary AND NOT indisunique ORDER BY pg_relation_size(indexrelid) DESC;
Drop indexes with idx_scan = 0 that have been around long enough to capture representative traffic (reset stats after major traffic pattern changes with SELECT pg_stat_reset()).
3. Indexes That Are Never Used by the Planner
An index exists but the planner refuses to use it — not because it's wrong, but because the cost estimate says sequential scan is cheaper.
sqlEXPLAIN SELECT * FROM transactions WHERE block_height > 18000000; -- Seq Scan on transactions (cost=0.00..285432.00 rows=9500000 ...) -- Filter: (block_height > 18000000)
If block_height > 18000000 matches 95% of rows, the planner is correct — sequential scan is faster. The index exists but is never used. It costs write overhead on every insert with no read benefit.
Force the planner to show you what it thinks:
sql-- Check estimated selectivity SELECT n_distinct, correlation, most_common_vals, most_common_freqs, histogram_bounds FROM pg_stats WHERE tablename = 'transactions' AND attname = 'block_height';
4. Indexes That Break HOT Updates
Any index on a column that is frequently updated prevents HOT updates on those updates, forcing full index maintenance.
sql-- If 'last_seen_at' is updated every time a user logs in: CREATE INDEX idx_users_last_seen ON users (last_seen_at); -- Every login now does a full index delete + insert instead of HOT update -- Remove this index if 'last_seen_at' is not used in WHERE/ORDER BY queries
5. Unused Unique Indexes Beyond the Constraint
Unique indexes are often added for constraint enforcement, which is valid. But sometimes engineers add unique indexes "just in case" without realizing the constraint enforcement itself already creates an index:
sql-- This creates an implicit unique index: ALTER TABLE transactions ADD CONSTRAINT transactions_hash_key UNIQUE (hash); -- Adding this is redundant — same index already exists: CREATE UNIQUE INDEX idx_transactions_hash ON transactions (hash); -- DUPLICATE
Check for duplicate indexes (same columns, same order) and drop the redundant ones.
The Planner and Indexes: When Good Indexes Are Ignored
Statistics Freshness
The planner decides whether to use an index based on statistics. If statistics are stale (bulk load without ANALYZE, or autovacuum hasn't run), the planner may misestimate selectivity and choose the wrong plan.
sql-- After any bulk load or major data change: ANALYZE transactions; -- Check when statistics were last updated: SELECT attname, n_distinct, correlation, most_common_vals FROM pg_stats WHERE tablename = 'transactions';
enable_indexscan and Plan Forcing
sql-- Force the planner to use (or avoid) index scans for debugging SET enable_indexscan = off; -- disable index scans SET enable_bitmapscan = off; -- disable bitmap index scans SET enable_seqscan = off; -- disable sequential scans (forces index use) -- Never use these in production — diagnostic only
Bitmap Index Scans
When a query matches many rows via an index, Postgres may use a bitmap index scan instead of a regular index scan:
- Scan the index, building a bitmap of matching heap page numbers (not individual TIDs)
- Sort the page numbers
- Read heap pages in sorted order (sequential-ish access pattern)
This is more efficient than random-order index scan when many rows match. It trades the precise TID ordering for coarser page-level ordering, which reduces random I/O.
sqlEXPLAIN SELECT * FROM transactions WHERE block_height BETWEEN 18500000 AND 18510000; -- Bitmap Heap Scan on transactions -- Recheck Cond: (block_height BETWEEN 18500000 AND 18510000) -- -> Bitmap Index Scan on idx_transactions_block_height -- Index Cond: (block_height BETWEEN 18500000 AND 18510000)
Multiple bitmap index scans can be ANDed or ORed together:
sqlEXPLAIN SELECT * FROM transactions WHERE block_height BETWEEN 18500000 AND 18510000 AND status = 'pending'; -- BitmapAnd -- Bitmap Index Scan on idx_transactions_block_height -- Bitmap Index Scan on idx_transactions_pending
Production Index Audit Workflow
sql-- 1. Find large unused indexes SELECT relname AS table, indexrelname AS index, pg_size_pretty(pg_relation_size(indexrelid)) AS size, idx_scan, idx_tup_read, idx_tup_fetch FROM pg_stat_user_indexes JOIN pg_indexes ON indexrelname = indexname AND relname = tablename WHERE idx_scan < 100 AND pg_relation_size(indexrelid) > 10000000 ORDER BY pg_relation_size(indexrelid) DESC; -- 2. Find duplicate indexes (same column set) SELECT indrelid::regclass AS table, array_agg(indexrelid::regclass) AS indexes, array_agg(indkey) AS column_sets FROM pg_index GROUP BY indrelid, indkey HAVING count(*) > 1; -- 3. Check index bloat SELECT indexrelname, pg_size_pretty(pg_relation_size(indexrelid)) AS size, idx_scan FROM pg_stat_user_indexes JOIN pg_index USING (indexrelid) ORDER BY pg_relation_size(indexrelid) DESC LIMIT 20; -- 4. Find missing indexes (tables with sequential scans on large tables) SELECT relname, seq_scan, seq_tup_read, idx_scan, n_live_tup, pg_size_pretty(pg_total_relation_size(relid)) AS size FROM pg_stat_user_tables WHERE seq_scan > idx_scan AND n_live_tup > 100000 ORDER BY seq_tup_read DESC LIMIT 20;
Production Incident: Index Bloat Causing 3x Query Regression
Context: A transactions table with a B-tree index on (sender, block_height) used for API queries like "show me all transactions from sender X in the last 10,000 blocks."
What happened:
The table processed roughly 500 updates/second to the status column. These updates were NOT HOT updates because an index on status existed from an earlier investigation. The index on (sender, block_height) was fine — but each non-HOT update to status caused a full index page access in the sender index as well, because the new tuple needed a new entry even though sender hadn't changed.
After 6 months, the (sender, block_height) index had grown to 8GB from its original 2GB. Average leaf page density had dropped to 45% from the constant churn of dead tuple entries.
sql-- Diagnosis SELECT * FROM pgstatindex('idx_transactions_sender_height'); -- leaf_pages: 1048576 -- avg_leaf_density: 44.8 -- leaf_fragmentation: 38.2
The query WHERE sender = '0xabc...' AND block_height > 18400000 now scanned 2x as many leaf pages as necessary, because each page was only 45% full. Query time went from 2ms to 6ms — imperceptible in isolation, but 3x slower and visible in p99 latency.
The fix:
sql-- Step 1: Remove the unused status index (cause of non-HOT updates) DROP INDEX idx_transactions_status; -- Step 2: Reindex without locking (Postgres 12+) REINDEX INDEX CONCURRENTLY idx_transactions_sender_height; -- Step 3: Verify HOT update rate improved SELECT n_tup_upd, n_tup_hot_upd, ROUND(100.0 * n_tup_hot_upd / NULLIF(n_tup_upd, 0), 1) AS hot_pct FROM pg_stat_user_tables WHERE relname = 'transactions'; -- hot_pct went from 12% to 94% after dropping the status index
After the reindex, query time returned to 2ms. The REINDEX INDEX CONCURRENTLY took 40 minutes on the 8GB index but did not lock reads or writes during the rebuild.
Summary
| Index Type | Best For | Avoid When |
|---|---|---|
| B-tree | Equality, range, ordering, most queries | Very low cardinality columns |
| GIN | Array containment, JSONB @>, full-text | Write-heavy columns (high write cost) |
| GiST | Ranges, spatial data, KNN | Simple equality or range on scalars |
| BRIN | Sequential, append-only timestamp/ID columns | Low-correlation or random data |
| Hash | High-cardinality equality only | Anything needing range or ordering |
| Partial | Selective subsets of rows | When most rows match the condition |
| Expression | Function-based lookups | Frequently updated expressions |
| Covering (INCLUDE) | Index-only scan for extra columns | When included columns are large |
Core rules:
- Every index has a write cost — measure it, don't ignore it
- Remove indexes the planner isn't using (
idx_scan = 0) - Low-cardinality columns need partial indexes, not full indexes
- HOT updates are killed by indexes on updated columns — audit before adding
- Check correlation before relying on B-tree range scans
- BRIN is underused — add it to sequential timestamp/ID columns
- After bulk loads, always
ANALYZE— stale stats cause good indexes to be ignored
Indexes determine how the planner accesses data. Module 6 covers how the planner actually makes those decisions — the cost model, statistics, join strategies, and how to diagnose every class of plan regression.
Next: Module 6 — Query Planning and Execution: How Postgres Decides What to Do With Your Query →