Back/Module 1 The Storage Engine: Pages, Heaps, and the True Cost of a Row
Module 1·34 min read

Where your data actually lives and why every abstraction above this layer has a cost.

Module 1 — The Storage Engine: Pages, Heaps, and the True Cost of a Row

What this module covers: You will leave this module able to answer questions that most senior engineers cannot: Why does an UPDATE on a single column grow a table? Why does SELECT * on a freshly-loaded 500GB table miss the index? Why does storing a JSONB column with 10KB documents quietly make every query 3x slower? These are not configuration problems. They are storage engine problems — and they all trace back to what you are about to learn.


The 8KB Page: The Fundamental Unit of Everything

Postgres does not read individual rows from disk. It does not even read individual columns. Every I/O operation — whether you are fetching one row or one million — happens in units of 8KB pages.

This has a consequence that surprises most engineers: a query that touches a single 100-byte row reads 8,192 bytes from disk. The other 8,092 bytes come along for free. Understanding what is in those 8,092 bytes is the foundation of storage engine literacy.

Anatomy of a page

Every 8KB page in a Postgres heap file has the same structure:

┌─────────────────────────────────────────────────────────┐
│ PageHeaderData         (24 bytes)                        │
├─────────────────────────────────────────────────────────┤
│ ItemId array           (4 bytes × number of items)       │
├─────────────────────────────────────────────────────────┤
│                                                          │
│         Free Space                                       │
│                                                          │
├─────────────────────────────────────────────────────────┤
│ Tuples (heap items)    (variable, grows downward)        │
└─────────────────────────────────────────────────────────┘

PageHeaderData is 24 bytes containing:

  • pd_lsn — the Log Sequence Number of the last WAL record that modified this page. Used for crash recovery.
  • pd_checksum — checksum of the page contents (if data_checksums is enabled at cluster init time).
  • pd_flags — whether the page has free space, all-visible (all tuples are visible to all transactions), all-frozen.
  • pd_lower — byte offset to the end of the ItemId array.
  • pd_upper — byte offset to the start of the tuple data (which grows from the bottom).
  • pd_special — byte offset to the start of special space (used by index pages, always equal to page size for heap pages).

The free space in a page is pd_upper - pd_lower. When you insert a row, Postgres:

  1. Writes the tuple data starting at pd_upper and decrements pd_upper.
  2. Adds an ItemId entry to the ItemId array and increments pd_lower.

When pd_lower >= pd_upper, the page is full.

Inspecting a real page

Install pageinspect and examine your transactions table:

sql
CREATE EXTENSION pageinspect; -- Look at the header of page 0 of the transactions heap SELECT lsn, checksum, flags, lower, upper, special, pagesize, version, prune_xid FROM page_header(get_raw_page('transactions', 0));
lsn          | 0/3FA10B8
checksum     | 0
flags        | 0
lower        | 160
upper        | 2104
special      | 8192
pagesize     | 8192
version      | 4
prune_xid    | 0

Free space: upper - lower = 2104 - 160 = 1944 bytes. This page has roughly 1.9KB of free space remaining.

To see the individual item pointers:

sql
SELECT lp AS item_pointer, lp_off AS offset, lp_flags AS flags, lp_len AS length FROM heap_page_items(get_raw_page('transactions', 0)) LIMIT 10;
 item_pointer | offset | flags | length
--------------+--------+-------+--------
            1 |   7960 |     1 |    232
            2 |   7728 |     1 |    232
            3 |   7496 |     1 |    232
            4 |   7264 |     1 |    232
            5 |   7032 |     1 |    232

Each row on this page is 232 bytes. Note the offsets count backward from 8192 — tuples are stored from the bottom of the page upward.


The Heap File Layout: Where Your Table Actually Lives

A Postgres table is stored in a directory named by the database OID and consists of one or more files named by the table's OID (called a relfilenode).

sql
-- Find where your table is stored on disk SELECT pg_relation_filepath('transactions'); -- Returns: base/16384/24601

The file base/16384/24601 is the main fork — the heap data. But it is not the only file associated with your table.

The four forks

Every table has up to four file forks:

ForkFilename suffixContents
Main(none)The actual row data
FSM_fsmFree Space Map
VM_vmVisibility Map
Init_initUnlogged table init fork (empty)

Run ls -la on your data directory to see them:

bash
ls -la $PGDATA/base/16384/24601* # -rw------- 1 postgres 1073741824 24601 (1GB segment of heap data) # -rw------- 1 postgres 245760 24601_fsm (Free Space Map) # -rw------- 1 postgres 24576 24601_vm (Visibility Map)

File segments

Postgres caps each segment at 1GB by default. When your table exceeds 1GB, a second segment is created: 24601.1, then 24601.2, and so on. This means a 10TB table is stored across ~10,000 files.

This has implications for filesystem limits: if you configure your filesystem with an insufficient inode count and have many large tables, you can hit inode exhaustion even when disk space is available.


How a Row Is Physically Encoded

Understanding the wire format of a heap tuple explains a surprising amount of Postgres behavior.

HeapTupleHeaderData

Every row on disk starts with a HeapTupleHeaderData — a fixed overhead before the actual column values:

c
typedef struct HeapTupleHeaderData { /* Transaction visibility fields */ TransactionId t_xmin; /* 4 bytes: inserting transaction XID */ TransactionId t_xmax; /* 4 bytes: deleting/locking transaction XID */ union { CommandId t_cid; /* 4 bytes: command counter */ TransactionId t_xvac; /* VACUUM FULL moved me */ } t_field3; ItemPointerData t_ctid; /* 6 bytes: self or newer version pointer */ /* Misc flags */ uint16 t_infomask2; /* 2 bytes: number of attributes + flags */ uint16 t_infomask; /* 2 bytes: MVCC flags */ uint8 t_hoff; /* 1 byte: offset to user data */ /* NULL bitmap follows for rows with nullable columns */ /* Then the actual column values */ } HeapTupleHeaderData;

The minimum header size is 23 bytes. If the table has nullable columns, a null bitmap is appended (1 bit per column, rounded up to a byte boundary).

Key fields to understand:

  • t_xmin — the XID of the transaction that inserted this row version. Used by MVCC to determine visibility.
  • t_xmax — the XID of the transaction that deleted or locked this row. Zero for live, undeleted rows. Non-zero for rows deleted by UPDATE or DELETE.
  • t_ctid — a self-pointer for the current version. For a live row, this points to itself. After an UPDATE, the old version's t_ctid points to the new version (forming a chain). This chain is what MVCC uses to find the current version.

Column alignment and padding

This is the detail that surprises engineers the most.

Postgres aligns each column value to its natural alignment requirement. An 8-byte type (bigint, float8, timestamp) must start at a byte offset that is a multiple of 8. A 4-byte type (int, float4) must align to 4. A 2-byte type (smallint) to 2. Variable-length types (text, varchar, bytea) align to 4.

When column ordering causes misalignment, Postgres inserts padding bytes — wasted space — between columns.

Consider this schema:

sql
CREATE TABLE alignment_demo ( a smallint, -- 2 bytes, aligns to 2 b bigint, -- 8 bytes, aligns to 8 → needs 6 bytes of padding after 'a' c int -- 4 bytes, aligns to 4 );

The physical row layout:

| a (2B) | pad (6B) | b (8B) | c (4B) | = 20 bytes of data

Now reorder the columns:

sql
CREATE TABLE alignment_demo_opt ( b bigint, -- 8 bytes, aligns to 8 → starts at 0, no padding c int, -- 4 bytes, aligns to 4 → starts at 8, no padding a smallint -- 2 bytes, aligns to 2 → starts at 12, no padding );
| b (8B) | c (4B) | a (2B) | = 14 bytes of data

Same data, 30% smaller row. On a billion-row table, this difference is real:

sql
-- Measure the effect SELECT pg_size_pretty(pg_total_relation_size('alignment_demo')); SELECT pg_size_pretty(pg_total_relation_size('alignment_demo_opt'));

The rule: Order columns from largest natural alignment to smallest — bigint/timestamp/float8 first, then int/float4, then smallint, then text/varchar/bytea, then boolean.

Null bitmaps

If any column in a table is nullable (no NOT NULL constraint), Postgres stores a null bitmap in every row — even rows where no column is null.

The null bitmap occupies 1 bit per column, rounded up to the next byte. A table with 8 nullable columns has a 1-byte bitmap overhead per row. A table with 9–16 nullable columns has a 2-byte overhead. A table with 64 nullable columns has an 8-byte overhead.

For very wide tables with many nullable columns, the null bitmap becomes non-trivial. More importantly: if you declare all columns NOT NULL, the null bitmap is eliminated entirely.


TOAST: The Oversized Attribute Storage Technique

Postgres pages are 8KB. Many real-world values — JSON documents, text blobs, bytea fields — are larger than a page. This is where TOAST comes in.

The threshold and what happens

When a row is about to be inserted and its total size exceeds the TOAST threshold (roughly 2KB by default, 1/4 of a page), Postgres attempts to compress the oversized columns first. The compression algorithm used is pglz (Postgres-specific) or, since PG 14, optionally lz4.

If compression brings the row under the threshold, the compressed value is stored inline. If it does not, the value is moved out-of-line to the TOAST table — a separate heap file created automatically for every table with potentially-wide columns.

The TOAST table lives in the same database, with a name like pg_toast.pg_toast_24601. Each TOAST entry is chunked into 2KB pieces and stored with a chunk_id (the OID of the original value) and a chunk_seq number.

The original row stores only a small TOAST pointer — 18 bytes — referencing the TOAST table entry.

Why this destroys query performance silently

The silent performance impact of TOAST is one of the most under-discussed topics in Postgres performance tuning.

Scenario: You have a transactions table with a payload JSONB column. Most rows have small payloads (< 2KB). Some rows, for complex transactions with many events, have payloads of 20–50KB.

When you query for a single transaction:

sql
SELECT hash, sender, amount FROM transactions WHERE hash = $1;

Postgres executes this using an index on hash. It finds the row in the heap — fast. The hash, sender, and amount columns are inline. No TOAST access.

Now add payload to the SELECT:

sql
SELECT hash, sender, amount, payload FROM transactions WHERE hash = $1;

For large-payload rows, Postgres must now reassemble the TOAST chunks — reading multiple 2KB chunks from pg_toast.pg_toast_24601, stitching them back together, decompressing if necessary.

The problem compounds with SELECT *:

sql
-- This is always expensive if you have any TOAST columns EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM transactions WHERE block_height BETWEEN 18500000 AND 18500100;

For every row in the result that has a large payload, Postgres performs N additional buffer reads to reconstruct the TOAST data. A query that appears to scan 500 pages may actually read 5,000 buffer pages once TOAST is accounted for.

Practical TOAST strategies

1. Use SELECT with explicit columns, never SELECT * on tables with JSONB or text.

Even in application code, SELECT hash, sender, amount, status FROM transactions is categorically different from SELECT *. The former never touches the TOAST table for the missing columns.

2. Set STORAGE EXTERNAL to prevent compression but keep speed.

If you need fast TOAST retrieval but your data is already incompressible (binary data, pre-compressed payloads):

sql
ALTER TABLE transactions ALTER COLUMN payload SET STORAGE EXTERNAL;

EXTERNAL stores out-of-line without compression, allowing partial TOAST reads (Postgres can read just the first N bytes without decompressing the full value).

3. Set STORAGE PLAIN for columns you know will always be small.

sql
ALTER TABLE transactions ALTER COLUMN status SET STORAGE PLAIN;

PLAIN forces inline storage and disables compression. Use this for short varchar columns that Postgres might otherwise attempt to compress.

4. Monitor TOAST table size separately:

sql
SELECT c.relname AS main_table, t.relname AS toast_table, pg_size_pretty(pg_total_relation_size(c.oid)) AS main_size, pg_size_pretty(pg_total_relation_size(t.oid)) AS toast_size FROM pg_class c JOIN pg_class t ON c.reltoastrelid = t.oid WHERE c.relname = 'transactions';

On a blockchain indexer, it is common for the TOAST table to be 2–5x larger than the main heap when events/payloads are stored as JSONB.


The Visibility Map: Why Index-Only Scans Fail on Fresh Tables

The Visibility Map (VM) is a compact bitfield — 1 bit per heap page — that records whether all rows on a page are visible to all current and future transactions.

When the VM bit for a page is set, Postgres knows it can return rows from an index-only scan without visiting the heap at all. The page is "clean" — there are no dead tuples, and all tuples are committed and visible.

When the VM bit is clear, Postgres must visit the heap to check tuple visibility, defeating the purpose of an index-only scan.

Why freshly loaded tables miss index-only scans

sql
-- Load 10 million rows COPY transactions FROM '/data/transactions.csv'; -- Run an index-only scan (should be fast) EXPLAIN (ANALYZE, BUFFERS) SELECT COUNT(hash) FROM transactions WHERE block_height BETWEEN 18000000 AND 18100000;

Expected plan: Index Only Scan. Actual plan: Index Scan (or worse, sequential scan).

The reason: COPY does not set VM bits. All VM bits for the newly-written pages are clear. Postgres must visit the heap to check visibility, making index-only scans impossible until after VACUUM runs and sets the VM bits.

sql
-- Fix: run VACUUM after bulk load VACUUM transactions; -- Now check if VM bits are set SELECT blkno, all_visible, all_frozen FROM pg_visibility('transactions') WHERE blkno < 10;

After VACUUM, the all_visible column should be true for pages with only committed rows. Index-only scans become available immediately.

This is why the post-bulk-load sequence should always be:

sql
COPY transactions FROM '/data/transactions.csv'; ANALYZE transactions; -- updates planner statistics VACUUM transactions; -- sets VM bits, enables index-only scans

Not just ANALYZE. The ANALYZE call updates statistics so the planner can choose better plans, but it does not set VM bits. Both are required.

The all-frozen bit

Each heap page has a second VM bit: all_frozen. When all tuples on a page have been frozen (their t_xmin replaced with FrozenTransactionId), this bit is set and VACUUM will never need to visit that page again during freeze operations.

For tables with billions of rows, the difference between 0% all-frozen and 100% all-frozen is the difference between a 6-hour weekly VACUUM FREEZE and a near-instant one.


The Free Space Map: How Postgres Finds Space for New Rows

When Postgres inserts a row, it needs to find a page with enough free space to accommodate the new tuple. Scanning every page of a large table to find one with free space would be catastrophically slow.

The Free Space Map (FSM) solves this. It is a compact tree structure that maps each page to an approximate free space bucket (0–255, representing 0% to 100% free space in steps of roughly 0.4%).

How Postgres uses the FSM

On insert, Postgres queries the FSM: "give me a page with at least N bytes free." The FSM returns a page number. Postgres reads that page, inserts the row, and updates the FSM entry.

The FSM is updated by VACUUM and by heap insert operations. It is not updated in real time on every delete or update — which means the FSM can temporarily overestimate free space (point to a page that has since been filled) or underestimate it (not yet know about space freed by VACUUM).

Inspecting the FSM

sql
CREATE EXTENSION pg_freespacemap; -- Average free bytes per page SELECT blkno, avail AS free_bytes, round(avail::numeric / 8192 * 100, 1) AS pct_free FROM pg_freespace('transactions') WHERE avail > 0 ORDER BY avail DESC LIMIT 20;
 blkno  | free_bytes | pct_free
--------+------------+----------
  24501 |       3456 |     42.2
  24502 |       3456 |     42.2
  24503 |       3456 |     42.2
  24504 |       3200 |     39.1

These are pages with significant free space — likely recently vacuumed pages where dead tuples were reclaimed.

Why VACUUM without ANALYZE leaves money on the table

VACUUM without ANALYZE:

  • Reclaims dead tuple space ✓
  • Updates the FSM ✓
  • Sets VM bits ✓
  • Updates planner statistics ✗

After a large DELETE followed by VACUUM (no ANALYZE), the planner still believes the table has the old row count. It may choose sequential scans because it thinks there are 100 million rows when there are now only 10 million. The correct command is VACUUM ANALYZE.


Production Example: Diagnosing a 3x Query Regression After Bulk Insert

The incident: A blockchain indexer processes a batch of 2 million historical blocks. Post-load, a frequently-used query — "get all transactions for a given sender in the last 30 days" — went from 120ms to 380ms. Nothing about the query or the schema changed.

Step 1: Identify what changed

sql
-- Check when statistics were last updated SELECT schemaname, tablename, last_analyze, last_autoanalyze, n_live_tup, n_dead_tup, n_mod_since_analyze FROM pg_stat_user_tables WHERE tablename = 'transactions';
 last_analyze       | 2024-03-10 02:15:43
 last_autoanalyze   | NULL
 n_live_tup         | 48000000
 n_dead_tup         | 0
 n_mod_since_analyze| 2000000

Two million modifications since the last analyze. The statistics are stale.

Step 2: Check the query plan before and after

sql
EXPLAIN (ANALYZE, BUFFERS) SELECT hash, amount, timestamp FROM transactions WHERE sender = '0xabc123...' AND timestamp > now() - interval '30 days' ORDER BY timestamp DESC LIMIT 100;

Before ANALYZE:

Seq Scan on transactions  (cost=0.00..892441.00 rows=142 width=76)
                          (actual time=12.847..341.221 rows=89 loops=1)
  Filter: ((sender = '0xabc123...') AND (timestamp > ...))
  Rows Removed by Filter: 47999911
  Buffers: shared hit=183441 read=62891

The planner chose a sequential scan because its estimated row count for sender = '0xabc123...' (142 rows) was based on old statistics. With 2 million new rows inserted, the selectivity estimate was wrong.

After ANALYZE transactions:

Index Scan using idx_transactions_sender_timestamp on transactions
  (cost=0.57..891.33 rows=89 width=76) (actual time=0.312..4.811 rows=89 loops=1)
  Index Cond: ((sender = '0xabc123...') AND (timestamp > ...))
  Buffers: shared hit=94 read=1

Execution time: 4.8ms. The planner now has accurate statistics and chose the index correctly.

Step 3: Check VM bits for index-only scan eligibility

sql
-- How many pages have VM bits set? SELECT COUNT(*) FILTER (WHERE all_visible) AS visible_pages, COUNT(*) AS total_pages, round(COUNT(*) FILTER (WHERE all_visible)::numeric / COUNT(*) * 100, 1) AS pct_visible FROM pg_visibility('transactions');
 visible_pages | total_pages | pct_visible
---------------+-------------+-------------
        118204 |      131072 |        90.2

Only 90.2% of pages have their VM bit set. The newly-inserted pages (the last ~13,000 pages) are not yet all-visible. Any query that could use an index-only scan is forced to visit the heap for those pages.

sql
-- Fix: VACUUM sets VM bits for committed rows VACUUM transactions; -- Re-check SELECT COUNT(*) FILTER (WHERE all_visible) AS visible_pages, COUNT(*) AS total_pages FROM pg_visibility('transactions');
 visible_pages | total_pages
---------------+-------------
        131072 |      131072

Now 100% of pages are all-visible. Index-only scans on the transactions table are fully enabled.

The complete post-bulk-load runbook

sql
-- After any significant bulk insert, always run in this order: ANALYZE transactions; -- 1. Update statistics first (fast) VACUUM transactions; -- 2. Set VM bits, update FSM (slower) -- Verify the fix: SELECT last_analyze, n_mod_since_analyze, n_live_tup FROM pg_stat_user_tables WHERE tablename = 'transactions'; SELECT pct_visible FROM ( SELECT round( COUNT(*) FILTER (WHERE all_visible)::numeric / COUNT(*) * 100, 1 ) AS pct_visible FROM pg_visibility('transactions') ) t;

Expected: n_mod_since_analyze = 0, pct_visible = 100.0.


Summary

ConceptProduction Implication
8KB pagesEvery I/O reads a full page. Row width directly affects pages-per-table.
Column alignmentReorder columns (large to small) to eliminate padding. Real savings on wide tables.
TOASTNever use SELECT * on tables with JSONB/text. TOAST amplifies I/O silently.
Visibility MapRun VACUUM after bulk loads to enable index-only scans.
Free Space MapVACUUM without ANALYZE leaves planner statistics stale. Use VACUUM ANALYZE.
Null bitmapsDeclare NOT NULL wherever possible to eliminate null bitmap overhead and enable optimisations.

In Module 2, we go one level deeper — into the MVCC machinery that makes all concurrent access possible, why dead tuples accumulate regardless of your workload, and the exact conditions under which your 32-bit transaction ID clock will shut your database down.

Next: Module 2 — MVCC: The Architecture That Makes Concurrency Possible (and Expensive) →

Discussion