Module 0·26 min read

Why most engineers have a shallow model of Postgres — and what it costs them in production.

Module 0 — Before You Proceed: Mental Model Reset

Who this is for: Mid-to-senior backend engineers who have used Postgres in production but have never looked under the hood. You know how to write queries. You know how to add indexes. You may have even tuned work_mem. But when your query inexplicably regresses after a bulk insert, or your replica falls six hours behind on a Thursday morning, or VACUUM takes longer than your deployment window — you don't yet have the mental model to diagnose it from first principles.

That is what this module builds.

The Problem With How Most Engineers Learn Postgres

Most engineers learn Postgres through three channels:

The official documentation — comprehensive, accurate, and almost entirely organised around syntax rather than mechanics.
Tutorial sites — which teach you to SELECT, JOIN, and INDEX using tables named users and orders with 20 rows.
Stack Overflow — which tells you what to do but almost never why it works.

The result is an engineer who can write correct SQL but who treats the database as a black box. When that black box behaves unexpectedly under production load, they reach for EXPLAIN ANALYZE, see a sequential scan where they expected an index scan, and have no model for why the planner made that choice.

This course is the model they were never given.

What Postgres Actually Is

Postgres is not "just a reliable RDBMS." That framing is technically correct and practically useless.

Here is a more accurate description:

PostgreSQL is a multi-process, heap-based, MVCC-driven database engine with a cost-based query planner, a Write-Ahead Log for crash recovery and replication, and a background worker system for maintenance — all of which interact in ways that are not visible from the SQL interface.

Every word in that sentence is load-bearing. Let's unpack each one.

Multi-process architecture

When you start a Postgres instance, you start a supervisor process called the postmaster. The postmaster listens for connections and forks a new backend process for each one. Each backend is an independent OS process with its own memory — there is no thread pool, no shared execution context between connections.

This has direct consequences:

work_mem is allocated per sort operation per backend. If you set work_mem = 256MB and 100 clients run a query with two sort operations each, you have potentially allocated 51GB of RAM.
There is no shared query plan cache between connections in the way Oracle or SQL Server have one. Postgres caches plans per prepared statement per backend.
Connection overhead is real. Forking a process is expensive. This is why connection pooling (PgBouncer) is not optional above a few hundred connections.

The background workers running alongside backends include:

WAL writer — flushes Write-Ahead Log buffers to disk
Background writer — writes dirty shared_buffers pages to disk
Checkpointer — performs periodic checkpoints
Autovacuum workers — reclaim dead tuple space
WAL sender / WAL receiver — handle replication
Stats collector — populates pg_stat_* views

When your database feels "slow," it is almost always one of these background processes that is either overloaded or misconfigured.

Heap-based storage

Postgres stores table data in heap files — unordered collections of 8KB pages. There is no concept of a clustered index where the table data is physically sorted by a key (unlike InnoDB in MySQL). When you insert a row, it goes into the first available page with enough free space. Period.

This has consequences:

Sequential scans are efficient because pages are read in order from disk.
Random-access queries on non-indexed columns scan the entire heap.
Table data has no inherent ordering. SELECT * FROM orders LIMIT 10 does not return the 10 most recent orders. It returns 10 rows from whatever pages happen to be read first.
Physical correlation between data and indexes degrades over time as rows are inserted, updated, and deleted in non-sequential patterns.

You will encounter the heap file again in Module 1, where we look at the exact byte layout of a page and what happens to that layout under UPDATE workloads.

MVCC-driven

Multi-Version Concurrency Control is the mechanism by which Postgres allows concurrent readers and writers without one blocking the other. Instead of acquiring read locks, Postgres creates a new version of every modified row — the old version remains visible to transactions that started before the modification, while the new version is visible to transactions that started after.

The consequence that most engineers miss: dead tuples accumulate on disk. When you UPDATE a row, Postgres does not overwrite the old value. It marks the old tuple as dead and writes a new one. The old tuple sits in the heap, occupying space, until VACUUM reclaims it.

On a table with a high update rate, dead tuple accumulation is continuous. Without properly tuned autovacuum, tables grow without bound even if the number of live rows stays constant.

We will spend all of Module 2 on MVCC mechanics, including the snapshot model, transaction visibility, and the transaction ID wraparound problem that will shut your database down if you ignore it long enough.

Cost-based query planner

When you execute a query, Postgres does not simply find the "obvious" way to execute it. It runs a cost-based optimizer that considers multiple execution plans — different join orders, different index choices, parallel vs sequential — and estimates the cost of each using statistics it collects about your data distribution.

The keyword is estimates. The planner does not know the true cost of a plan. It uses statistics from the last ANALYZE run. If those statistics are stale (because you just loaded 50 million rows without running ANALYZE), the planner will make wrong choices based on outdated data.

This is why "the query got slower after the data load" is such a common production incident. Nothing changed about the query or the schema. The statistics changed — or failed to change — and the planner chose a different plan.

Module 6 covers the planner in depth, including how to read plan output, what statistics Postgres collects, and how to diagnose plan regressions.

How a Single SELECT Travels Through the System

Understanding the pipeline a query goes through demystifies most "why is this slow?" questions.

sql
SELECT
    t.hash,
    t.sender,
    t.amount
FROM transactions t
WHERE t.block_height BETWEEN 18500000 AND 18500100
  AND t.status = 'confirmed'
ORDER BY t.timestamp DESC
LIMIT 50;

Here is what happens when this hits the server:

1. Connection & authentication

The postmaster forks a backend process for your connection. SSL handshake, authentication, and session setup happen here. This is the cost you amortize by using a connection pool.

2. Parser

The query string is converted into a parse tree — a structured representation of the SQL. The parser checks syntax only. It does not know whether transactions exists or whether block_height is a valid column. That comes next.

3. Semantic analysis (Query Rewriter)

The parse tree is resolved against the system catalog (pg_catalog). Table and column names are validated and replaced with their internal OIDs. Views are expanded into their underlying queries. ON INSERT rules are applied. The output is a query tree.

4. Planner / Optimizer

The query tree enters the planner. The planner enumerates candidate execution plans:

Should it use the index on block_height? Only if the range is selective enough relative to the table size.
Should it use the index on status? If 'confirmed' represents 95% of rows, the planner will skip it and scan.
Should it sort the results using an explicit sort node, or can it use an index scan that already produces ordered output?
Is this query eligible for parallel execution?

The planner estimates the cost of each viable plan using statistics from pg_statistic and emits the lowest-cost plan.

5. Executor

The executor runs the plan. It pulls rows through a tree of plan nodes — index scans, sequential scans, hash joins, sort nodes, limit nodes. Each node pulls from its children. The results bubble up.

6. Buffer management

When the executor needs a page from a table or index, it first checks shared_buffers — the in-memory page cache. Cache hit: the page is returned immediately. Cache miss: the page is read from disk (or the OS page cache), placed in shared_buffers, and returned.

This is why shared_buffers sizing matters. A working set that fits in shared_buffers is fast. A working set that doesn't is I/O-bound.

Reading `EXPLAIN (ANALYZE, BUFFERS, WAL)` as a Diagnostic Instrument

EXPLAIN output is not a performance report. It is a plan description with runtime annotations. Here is how to read it correctly.

sql
EXPLAIN (ANALYZE, BUFFERS, WAL, FORMAT TEXT)
SELECT
    t.hash,
    t.sender,
    t.amount
FROM transactions t
WHERE t.block_height BETWEEN 18500000 AND 18500100
  AND t.status = 'confirmed'
ORDER BY t.timestamp DESC
LIMIT 50;

Hypothetical output:

Limit  (cost=0.57..1823.44 rows=50 width=76) (actual time=0.412..4.231 rows=50 loops=1)
  Buffers: shared hit=312 read=87
  ->  Index Scan Backward using idx_transactions_timestamp on transactions t
        (cost=0.57..9832.20 rows=269 width=76) (actual time=0.408..4.218 rows=50 loops=1)
        Index Cond: (block_height BETWEEN 18500000 AND 18500100)
        Filter: (status = 'confirmed')
        Rows Removed by Filter: 23
        Buffers: shared hit=312 read=87
Planning Time: 0.381 ms
Execution Time: 4.267 ms

What each field means:

cost=0.57..1823.44 — estimated cost in arbitrary units. The first number is startup cost (cost before the first row is returned). The second is total cost. These are estimates based on statistics. Do not treat them as milliseconds.

actual time=0.412..4.231 — real elapsed time in milliseconds. The first number is time to first row. The second is total time.

rows=50 width=76 — estimated rows (50) and average row width in bytes (76). Compare the estimated rows to actual rows. Large divergence = stale statistics.

Buffers: shared hit=312 read=87 — 312 page reads came from shared_buffers (free). 87 required disk reads. The ratio matters: a query doing mostly read on a table it accesses frequently is a sign that shared_buffers is undersized or the working set is too large.

Index Scan Backward — the planner used the index on timestamp in reverse order to satisfy the ORDER BY DESC LIMIT 50 without a sort node. This is the correct choice here — it avoids materializing all matching rows just to sort and limit them.

Filter: (status = 'confirmed') Rows Removed by Filter: 23 — the status filter was applied after the index scan, not through a separate index. 23 rows matched block_height but not status. This tells you the status selectivity is high enough that a separate index is not worth it for this query pattern.

WAL — when included, shows WAL bytes generated. Useful for write queries to understand write amplification. Not shown here because this is a SELECT.

The Four ACID Guarantees Through the Lens of Implementation

Atomicity — A transaction either commits all its changes or none. Implementation: uncommitted changes exist in shared_buffers and WAL. On commit, WAL is flushed to disk. On rollback or crash before commit, the changes are never written to the heap in a visible form. The WAL contains enough information to redo committed work and undo uncommitted work.

Consistency — Constraints (NOT NULL, UNIQUE, FOREIGN KEY, CHECK) are enforced at transaction boundaries. This is entirely in the executor and constraint-checking code paths. It is not a property of the storage engine itself.

Isolation — Transactions do not see each other's uncommitted work. Implementation: MVCC snapshots. Each transaction gets a snapshot of the committed state of the database at a specific point in time. Uncommitted work by other transactions is invisible regardless of timing.

Durability — Committed transactions survive crashes. Implementation: WAL. Before a commit returns to the client, the WAL record for that commit is flushed to disk (by default — synchronous_commit = off trades this guarantee for latency). On crash recovery, WAL is replayed to reconstruct the committed state.

The interaction between these four: MVCC enables I (Isolation) without read locks. WAL enables D (Durability) without synchronous heap writes. These two design choices — MVCC and WAL — are responsible for almost every behavior you will observe in production, both good and frustrating.

Setting Up Your Environment for This Course

The examples in this course require PostgreSQL 18 (for Modules 8 and 9) and PG 14+ for everything else. Install instructions for the environments we'll use:

Option 1: Local from source (recommended for Modules 8+)

bash
# Clone and build PG 18
git clone https://github.com/postgres/postgres.git
cd postgres
git checkout REL_18_STABLE
./configure --prefix=/usr/local/pgsql18 --with-openssl
make -j$(nproc)
make install

# Initialize and start
/usr/local/pgsql18/bin/initdb -D /tmp/pgdata18
/usr/local/pgsql18/bin/pg_ctl -D /tmp/pgdata18 start

Option 2: Docker

bash
docker run --name pg18-course \
  -e POSTGRES_PASSWORD=coursepass \
  -p 5432:5432 \
  -v pgdata:/var/lib/postgresql/data \
  postgres:18

Diagnostic extensions to install

sql
-- Page-level inspection
CREATE EXTENSION pageinspect;

-- Free space map visibility
CREATE EXTENSION pg_freespacemap;

-- Tuple-level bloat analysis
CREATE EXTENSION pgstattuple;

-- Verify your PG version
SELECT version();
-- PostgreSQL 18.x ...

The working dataset

Throughout this course we use a transactions table from a blockchain indexer — a real-world schema with challenging characteristics: append-heavy write patterns, high cardinality on hash columns, range queries on block heights, and JSONB columns for event data.

sql
CREATE TABLE transactions (
    id          BIGSERIAL PRIMARY KEY,
    hash        BYTEA        NOT NULL UNIQUE,
    block_height BIGINT      NOT NULL,
    sender      VARCHAR(66)  NOT NULL,
    recipient   VARCHAR(66),
    amount      NUMERIC(38, 8),
    status      VARCHAR(20)  NOT NULL DEFAULT 'pending',
    timestamp   TIMESTAMPTZ  NOT NULL,
    payload     JSONB,
    created_at  TIMESTAMPTZ  NOT NULL DEFAULT now()
);

-- Create 10 million rows for meaningful benchmarks
INSERT INTO transactions (
    hash, block_height, sender, recipient,
    amount, status, timestamp, payload
)
SELECT
    decode(md5(i::text || random()::text), 'hex'),
    18000000 + (i % 500000),
    '0x' || md5((i * 3)::text),
    '0x' || md5((i * 7)::text),
    (random() * 10000)::numeric(38,8),
    (ARRAY['confirmed', 'confirmed', 'confirmed', 'pending', 'failed'])[floor(random()*5+1)],
    now() - (random() * interval '180 days'),
    jsonb_build_object(
        'gas_used', (21000 + floor(random() * 100000))::int,
        'input_data', md5(i::text),
        'events', jsonb_build_array(
            jsonb_build_object('type', 'Transfer', 'amount', (random()*1000)::numeric(20,8))
        )
    )
FROM generate_series(1, 10000000) AS i;

ANALYZE transactions;

What This Course Will Not Do

Before module 1, a clear statement of scope:

This course will not teach you SQL syntax. If you need to learn GROUP BY or WINDOW functions, start elsewhere and come back.

This course will not give you a configuration template to copy. Every system has different workload characteristics. A template that works for an OLAP analytics database will actively harm an OLTP payment processor. You will learn to derive the right configuration from first principles.

This course will not cover every Postgres feature. Foreign Data Wrappers, logical decoding clients, PostGIS, and TimescaleDB are out of scope. The focus is the core engine that underpins all of them.

What this course will give you: A mental model detailed enough to diagnose any production incident from first principles, and the vocabulary to reason about trade-offs before they become emergencies.

Summary

Concept	Key Takeaway
Multi-process	One OS process per connection. `work_mem` multiplies per connection. Connection pooling is mandatory at scale.
Heap-based storage	Table data is unordered. No clustered indexes. Physical layout degrades over time.
MVCC	Dead tuples accumulate on every update. Autovacuum reclaims them. Misconfigured autovacuum = table bloat.
Cost-based planner	Plans are estimates based on statistics. Stale statistics = wrong plans. `ANALYZE` after bulk loads.
WAL	Every write is WAL-first. WAL is the source of durability and replication. Understanding WAL = understanding write costs.

You now have the mental model to understand why Postgres behaves the way it does. Module 1 goes one level deeper — into the physical page layout where your data actually lives.

Next: Module 1 — The Storage Engine: Pages, Heaps, and the True Cost of a Row →

PreviousModule P-10: Zero-Downtime Schema Migrations Next Module 1: The Storage Engine: Pages, Heaps, and the True Cost of a Row