Module P-2·21 min read

The B-tree index for practitioners, EXPLAIN basics, composite and partial indexes, and the write cost trade-off.

P-2 — Indexes: When and How to Add Them

Who this module is for: You have built schemas and written queries. Now a query that was fast with 100 rows is slow with 100,000. This module explains what indexes actually do (without diving into the B-tree internals — that is Phase 3), how to tell if you need one, how to create and use them correctly, and the cost that most tutorials never mention: every index makes writes slower.

What an Index Actually Does

When you run SELECT * FROM products WHERE price = 49.99, PostgreSQL has two options:

Sequential scan — read every single row from the table and check whether price = 49.99. If you have 1,000,000 rows, it reads all 1,000,000.

Index scan — jump directly to the rows where price = 49.99 using a pre-built lookup structure. If 3 rows match, it reads roughly 3 rows plus the index overhead.

An index is a separate data structure maintained by PostgreSQL that maps column values to the physical locations of rows. Think of it like a book's index: instead of reading every page to find "PostgreSQL", you look it up in the index and go directly to the pages listed.

The core tradeoff: an index makes SELECT faster but makes INSERT, UPDATE, and DELETE slower — because every write must update both the table and the index. Adding an index to every column is not a good strategy.

Reading `EXPLAIN` — Your Diagnostic Tool

Before adding an index, measure. EXPLAIN shows the query execution plan PostgreSQL would use.

sql
-- Show the plan without executing the query
EXPLAIN SELECT * FROM tasks WHERE status = 'todo';

-- Show the plan AND execute, with actual timing
EXPLAIN ANALYZE SELECT * FROM tasks WHERE status = 'todo';

-- Full diagnostic output
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM tasks WHERE status = 'todo';

Reading the output:

Seq Scan on tasks  (cost=0.00..35.50 rows=234 width=128)
                             (actual time=0.012..0.487 rows=234 loops=1)
  Filter: (status = 'todo')
  Rows Removed by Filter: 316
Planning Time: 0.142 ms
Execution Time: 0.532 ms

Field	Meaning
`Seq Scan`	Sequential scan — reading every row
`cost=0.00..35.50`	Estimated cost (startup..total, arbitrary units)
`rows=234`	Estimated row count
`actual time=0.012..0.487`	Real time in milliseconds (start..end)
`Rows Removed by Filter: 316`	How many rows were read but discarded
`Planning Time`	Time to generate the plan
`Execution Time`	Total actual execution time

After adding an index:

Index Scan using idx_tasks_status on tasks
  (cost=0.29..8.31 rows=234 width=128)
  (actual time=0.018..0.089 rows=234 loops=1)
  Index Cond: (status = 'todo')
Planning Time: 0.234 ms
Execution Time: 0.112 ms

The Seq Scan became an Index Scan. Execution time dropped from 0.532ms to 0.112ms — about 5× faster on a small table. On a million-row table, the difference would be far more dramatic.

Creating Indexes

Basic Index

sql
-- Create an index on a single column
CREATE INDEX idx_tasks_status ON tasks (status);

-- Index name convention: idx_{table}_{column(s)}
-- Descriptive names help you understand what each index is for

`CREATE INDEX CONCURRENTLY` — No Table Lock

Regular CREATE INDEX locks the table from writes for the duration of the build. On a large production table, this can take minutes and block your application.

sql
-- Build the index without locking the table (safe for production)
CREATE INDEX CONCURRENTLY idx_tasks_status ON tasks (status);
-- Takes longer to build, but reads and writes continue normally
-- Use this for any table with live traffic

Unique Index

sql
-- Enforces uniqueness AND provides fast lookups
CREATE UNIQUE INDEX idx_users_email ON users (email);
-- Equivalent to: UNIQUE constraint (which creates an index automatically)

Dropping an Index

sql
DROP INDEX idx_tasks_status;
DROP INDEX CONCURRENTLY idx_tasks_status;  -- without locking writes

Composite Indexes — Column Order Matters

A composite index on (a, b) can be used for queries filtering on a alone, or a AND b together. It cannot efficiently answer queries on b alone.

sql
CREATE INDEX idx_tasks_project_status ON tasks (project_id, status);

-- ✅ Uses the index (leading column: project_id)
SELECT * FROM tasks WHERE project_id = 1;

-- ✅ Uses the index (both columns)
SELECT * FROM tasks WHERE project_id = 1 AND status = 'todo';

-- ❌ Does NOT use the index efficiently (missing leading column)
SELECT * FROM tasks WHERE status = 'todo';
-- PostgreSQL will do a sequential scan or a separate index if one exists

Rule: put the equality filter column first, the range filter column second.

sql
-- For a query like: WHERE status = 'todo' AND price > 50
-- Correct composite index:
CREATE INDEX idx_products_status_price ON products (status, price);
-- status = 'todo' narrows to a small set, then price > 50 filters within it

Partial Indexes — Index Only the Rows You Query

A partial index only includes rows matching a WHERE condition. It is smaller, faster to build, and uses less memory than a full index.

sql
-- Only index open tasks (the ones you query most often)
CREATE INDEX idx_tasks_open ON tasks (project_id, created_at)
WHERE status IN ('todo', 'in_progress');
-- Completed and cancelled tasks are excluded — they are rarely queried

-- Only index active users
CREATE INDEX idx_users_active_email ON users (email)
WHERE deleted_at IS NULL;

For a table where 95% of rows are in terminal states (done, archived), a partial index on the active 5% is dramatically smaller and faster.

Expression Indexes — Index a Computed Value

sql
-- Case-insensitive email lookups
CREATE INDEX idx_users_email_lower ON users (LOWER(email));

-- The query must use the same expression to use this index:
SELECT * FROM users WHERE LOWER(email) = LOWER('Alice@Example.com');
-- ✅ Uses idx_users_email_lower

SELECT * FROM users WHERE email = 'alice@example.com';
-- ❌ Does NOT use it (different expression)

Foreign Key Indexes — The Most Forgotten Optimization

PostgreSQL does not automatically create indexes on foreign key columns. This surprises many engineers. When you delete a row from the parent table, PostgreSQL must scan the child table to check for references — without an index, this is a full sequential scan.

sql
-- Foreign keys in tasks
ALTER TABLE tasks ADD CONSTRAINT tasks_project_fk
  FOREIGN KEY (project_id) REFERENCES projects(id);

ALTER TABLE tasks ADD CONSTRAINT tasks_assigned_fk
  FOREIGN KEY (assigned_to) REFERENCES users(id);

-- These indexes are NOT created automatically — add them manually:
CREATE INDEX idx_tasks_project_id  ON tasks (project_id);
CREATE INDEX idx_tasks_assigned_to ON tasks (assigned_to);
CREATE INDEX idx_comments_task_id  ON comments (task_id);
CREATE INDEX idx_comments_author   ON comments (author_id);

Rule: every foreign key column should have an index. Without it, joins and deletions that reference that column are slow.

The Write Cost of Indexes

Every index you add slows down every INSERT, UPDATE, and DELETE on that table.

A table with no indexes:
  INSERT: write 1 place (the table)
  UPDATE: write 1 place
  DELETE: write 1 place

The same table with 5 indexes:
  INSERT: write 6 places (table + 5 indexes)
  UPDATE: write up to 6 places (table + any index whose column changed)
  DELETE: write 6 places

For a table that has 100,000 reads per second but only 10 writes per second, many indexes are fine. For a table that has 50,000 writes per second (like an event log or metrics table), every index is expensive.

This is why you should not add an index "just in case." Measure first, add indexes for proven slow queries.

Common Index Mistakes

Mistake 1: Indexing a low-cardinality column

sql
-- A 'status' column with 4 possible values has low cardinality
-- An index on it is often useless when many rows match
CREATE INDEX idx_orders_status ON orders (status);

-- If 'confirmed' represents 80% of rows, PostgreSQL ignores the index
-- and does a sequential scan — it is cheaper to scan than to use the index
-- when the result set is large

-- FIX: use a partial index for the selective values only
CREATE INDEX idx_orders_pending ON orders (created_at)
WHERE status = 'pending';
-- Indexes only pending orders (maybe 2%) — highly selective

Mistake 2: Over-indexing a write-heavy table

sql
-- A logging table receiving 10,000 inserts/second
-- Every index you add costs you on every insert

-- BAD: indexing everything
CREATE INDEX idx_events_user_id    ON events (user_id);
CREATE INDEX idx_events_type       ON events (event_type);
CREATE INDEX idx_events_created_at ON events (created_at);
CREATE INDEX idx_events_session_id ON events (session_id);
-- 4 extra writes per INSERT, 10,000 inserts/sec = 40,000 extra write ops/sec

Mistake 3: Indexing a column used in a function

sql
-- The index is NOT used because the function wraps the column
CREATE INDEX idx_users_email ON users (email);
SELECT * FROM users WHERE LOWER(email) = 'alice@example.com';
-- ❌ Doesn't use the index (LOWER() prevents it)

-- FIX: create an expression index
CREATE INDEX idx_users_email_lower ON users (LOWER(email));
SELECT * FROM users WHERE LOWER(email) = 'alice@example.com';
-- ✅ Now uses the expression index

Mistake 4: Not running `ANALYZE` after bulk loads

sql
-- After inserting millions of rows, PostgreSQL's statistics are stale
-- The planner may choose sequential scan even when an index exists

INSERT INTO events SELECT ... FROM staging;  -- insert 5 million rows

-- Run ANALYZE to update statistics
ANALYZE events;
-- Now the planner knows the table is large and uses indexes appropriately

Monitoring Index Usage

sql
-- Are your indexes actually being used?
SELECT
  schemaname,
  tablename,
  indexname,
  idx_scan AS times_used,
  pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC;
-- Low idx_scan = index may not be needed
-- Consider dropping unused indexes (they still slow down writes)

-- Table statistics: sequential scans vs index scans
SELECT
  relname AS table,
  seq_scan,
  idx_scan,
  n_live_tup AS row_count
FROM pg_stat_user_tables
ORDER BY seq_scan DESC;
-- High seq_scan on large tables = potentially missing indexes

Practical Exercise: Indexing the Task Manager

Starting with the task manager schema from F-7 with 50,000 rows seeded:

sql
-- Run these EXPLAIN queries and note the execution plans

-- Q1: Are we doing sequential scans?
EXPLAIN ANALYZE
SELECT * FROM tasks WHERE project_id = 42 AND status = 'todo';
-- Likely: Seq Scan (no index yet)

-- Add the right index
CREATE INDEX CONCURRENTLY idx_tasks_project_status
ON tasks (project_id, status);

-- Run EXPLAIN again — should now show Index Scan
EXPLAIN ANALYZE
SELECT * FROM tasks WHERE project_id = 42 AND status = 'todo';

-- Q2: Add indexes for common query patterns
-- Tasks assigned to a user, ordered by due date
CREATE INDEX CONCURRENTLY idx_tasks_assigned_due
ON tasks (assigned_to, due_date)
WHERE status NOT IN ('done', 'cancelled');

-- Q3: Comments by task (very common query)
CREATE INDEX CONCURRENTLY idx_comments_task_created
ON comments (task_id, created_at);

-- Q4: User lookup by email (login flow)
CREATE INDEX CONCURRENTLY idx_users_email_lower
ON users (LOWER(email));

-- Q5: Check that foreign keys are indexed
CREATE INDEX CONCURRENTLY idx_tasks_project_id  ON tasks (project_id);
CREATE INDEX CONCURRENTLY idx_tasks_assigned_to ON tasks (assigned_to);
CREATE INDEX CONCURRENTLY idx_comments_task_id  ON comments (task_id);
CREATE INDEX CONCURRENTLY idx_comments_author   ON comments (author_id);
CREATE INDEX CONCURRENTLY idx_project_members_user
ON project_members (user_id);

-- Q6: Verify your indexes
\d tasks
-- Should show all the indexes you created

Summary

Concept	Key Takeaway
Sequential scan	Reads every row — fine for small tables or when most rows match
Index scan	Jumps directly to matching rows — fast for selective queries
`EXPLAIN ANALYZE`	Always measure before and after adding an index
`CREATE INDEX CONCURRENTLY`	Build index without blocking writes — use in production
Composite index	Column order matters: equality columns first, range columns last
Partial index	Only indexes rows matching a condition — smaller and faster
Expression index	Index a computed value (`LOWER(email)`)
Foreign key indexes	Not automatic — add manually for every FK column
Write cost	Every index slows down INSERT/UPDATE/DELETE
Low cardinality	Low-cardinality columns (status with 4 values) rarely benefit from full indexes
Index monitoring	`pg_stat_user_indexes.idx_scan = 0` means the index is never used — drop it

Module P-3 covers transactions and ACID in practice — what BEGIN, COMMIT, ROLLBACK actually do, isolation levels, and writing safe atomic operations for scenarios like bank transfers and inventory deduction.

Next: P-3 — Transactions and ACID in Practice →

PreviousModule P-1: Advanced SQL — The Patterns You Will Use Every Week Next Module P-3: Transactions and ACID in Practice