Back/Module 2 MVCC: The Architecture That Makes Concurrency Possible (and Expensive)
Module 2·32 min read

How an 18-hour idle-in-transaction session caused 17 GB of dead tuple bloat and 6 hours of replica lag — and the MVCC mechanics behind it.

Module 2 — MVCC: The Architecture That Makes Concurrency Possible (and Expensive)

What this module covers: Multi-Version Concurrency Control is the single most consequential design decision in Postgres. It enables readers and writers to coexist without locks — but every write creates a new row version, every delete leaves a corpse, and every update creates two corpses. Left unmanaged, MVCC accumulates waste until it shuts your database down entirely. This is not hypothetical. Transaction ID wraparound has forced emergency downtime at major companies. You need to understand this system from first principles.


The Problem MVCC Solves

In a locking database (like MySQL with SERIALIZABLE using range locks, or early PostgreSQL), a SELECT that reads a row must acquire a read lock, preventing concurrent writes. A UPDATE must acquire a write lock, preventing concurrent reads and writes.

At scale, this serialises all access to hot rows — a disaster for any high-throughput system.

MVCC solves this with a different model: instead of blocking reads, create a new version of every modified row. Readers see the old version. Writers create the new version. Nobody blocks.

The cost is space: you accumulate old versions. And there is a hard limit on how many transaction IDs you can use before the version visibility system breaks down completely.


Transaction IDs: The Clock That Drives Everything

Every transaction in Postgres is assigned a Transaction ID (XID) — a 32-bit unsigned integer that increments monotonically. Transaction IDs are used to determine which row versions are visible to which transactions.

sql
-- See your current transaction ID SELECT txid_current(); -- 4831728 -- See current transaction info SELECT * FROM txid_current_snapshot(); -- txid_current_snapshot -- --------------------- -- 4831725:4831728:4831725,4831726

The snapshot format is xmin:xmax:xip_list:

  • xmin — all transactions with XID < this are committed and visible
  • xmax — all transactions with XID >= this are in-progress or haven't started and are invisible
  • xip_list — transactions between xmin and xmax that are still in-progress

The 32-bit limit

The XID counter is 32 bits — it wraps around at 2^32 = ~4.3 billion transactions. Postgres uses a concept of "transaction ID age": from any current transaction, XID values in the "past" half of the 32-bit circle are considered committed, and values in the "future" half are considered in-progress.

This means Postgres can only "see back" ~2.1 billion transactions at any time. When a row's t_xmin is more than ~2.1 billion transactions old, Postgres can no longer determine its visibility status — it becomes invisible to everything.

This is transaction ID wraparound, and it will shut down your database.

We will cover the mechanics and prevention in depth later in this module.


Row Versioning: xmin, xmax, and the Visibility Chain

Every row on disk carries two XID fields in its HeapTupleHeaderData:

  • t_xmin — the XID of the transaction that created this row version
  • t_xmax — the XID of the transaction that deleted or updated this row version (0 if still live)

What happens on INSERT

sql
BEGIN; INSERT INTO transactions (hash, block_height, sender, amount, status, timestamp) VALUES (decode('deadbeef', 'hex'), 18500000, '0xabc', 100.0, 'confirmed', now()); COMMIT; -- XID = 4831729

The inserted row has:

  • t_xmin = 4831729 (the inserting transaction)
  • t_xmax = 0 (not yet deleted)

What happens on UPDATE

sql
BEGIN; -- XID = 4831730 UPDATE transactions SET status = 'failed' WHERE hash = decode('deadbeef', 'hex'); COMMIT;

Postgres does not modify the existing row. Instead:

  1. The old row gets t_xmax = 4831730 (marking it as deleted by this transaction)
  2. A new row is inserted with t_xmin = 4831730, t_xmax = 0, and the new status value
  3. The old row's t_ctid is updated to point to the new row's physical location

You now have two versions of the same logical row on disk:

Old version: t_xmin=4831729, t_xmax=4831730, status='confirmed'
New version: t_xmin=4831730, t_xmax=0,       status='failed'

The old version is now a dead tuple — no transaction started after 4831730 can see it, but it still occupies space.

What happens on DELETE

sql
BEGIN; -- XID = 4831731 DELETE FROM transactions WHERE hash = decode('deadbeef', 'hex'); COMMIT;

The row gets t_xmax = 4831731. No new row is created. The row becomes dead after all transactions that started before 4831731 complete.

Inspecting row versions live

sql
-- See xmin/xmax for rows in a page SELECT t_xmin, t_xmax, t_ctid, t_infomask, substring(t_data::text, 1, 50) AS data_preview FROM heap_page_items(get_raw_page('transactions', 0)) WHERE t_xmin IS NOT NULL LIMIT 5;
 t_xmin  | t_xmax  |  t_ctid  | t_infomask | data_preview
---------+---------+----------+------------+--------------
 4831729 |       0 | (0,1)    |       2306 | \xdeadbeef...
 4831730 |       0 | (0,2)    |       2306 | \xdeadbeef...

After an update, you see both versions on the page. The old version's t_xmax would be non-zero.


Snapshots: How Visibility Is Determined

When a transaction reads data, it uses a snapshot to determine which row versions are visible. The snapshot captures the state of all active transactions at a specific moment.

The snapshot algorithm

For a row version to be visible to a transaction with snapshot (xmin, xmax, xip_list):

The row was inserted by a visible transaction when:

  • t_xmin < snapshot.xmin (committed before snapshot was taken) AND t_xmin's transaction committed

OR:

  • snapshot.xmin <= t_xmin < snapshot.xmax AND t_xmin is NOT in xip_list (committed between snapshot xmin and xmax)

AND the row was not deleted by a visible transaction:

  • t_xmax = 0 (never deleted)
  • OR t_xmax is in xip_list (still in-progress — the delete hasn't committed)
  • OR t_xmax >= snapshot.xmax (started after the snapshot was taken)

This logic is implemented in HeapTupleSatisfiesMVCC() in the Postgres source code. Every row access in every query runs this function for every candidate row version.

The performance implication

MVCC visibility checking runs for every row the executor evaluates — not just the rows returned. On a sequential scan of 100 million rows that returns 1000, visibility is checked 100 million times. Each check reads t_xmin, t_xmax, and potentially queries the commit log.

For committed transactions (the common case), Postgres caches the commit status in the tuple's t_infomask bits after the first check — the "hint bit" optimisation. But for very old transactions or tables that haven't been vacuumed, these hint bits may not be set, forcing a commit log lookup on every visibility check.


Isolation Levels: Where the Snapshot Is Taken

The difference between READ COMMITTED and REPEATABLE READ is not a conceptual one about what data you see. It is a mechanical one about when the snapshot is taken.

READ COMMITTED

A new snapshot is taken at the start of each statement within the transaction.

sql
BEGIN; -- transaction T1, isolation = READ COMMITTED (default) SELECT COUNT(*) FROM transactions WHERE status = 'pending'; -- Snapshot taken: sees all rows committed before this statement -- Result: 1000 -- Meanwhile, T2 commits a DELETE that removes 500 pending rows SELECT COUNT(*) FROM transactions WHERE status = 'pending'; -- NEW snapshot taken: sees T2's committed delete -- Result: 500 ← different result! COMMIT;

This is not a bug — it is the defined behaviour of READ COMMITTED. Each statement gets a fresh view of committed data. This is why you can observe non-repeatable reads: the same query returns different results within the same transaction.

REPEATABLE READ

A snapshot is taken once at the start of the first statement in the transaction and reused for all subsequent statements.

sql
BEGIN ISOLATION LEVEL REPEATABLE READ; -- transaction T1 SELECT COUNT(*) FROM transactions WHERE status = 'pending'; -- Snapshot taken ONCE here: sees 1000 -- Meanwhile, T2 commits a DELETE that removes 500 pending rows SELECT COUNT(*) FROM transactions WHERE status = 'pending'; -- SAME snapshot reused: T2's changes are invisible -- Result: 1000 ← consistent with first read COMMIT;

A REPEATABLE READ transaction always sees the same snapshot. However, if T1 tries to update a row that T2 has already modified and committed, T1 will get a serialisation error:

ERROR:  could not serialize access due to concurrent update

This is not the same as a deadlock. It is MVCC enforcing that the transaction's view of the world is no longer consistent with the actual committed state.

READ COMMITTED vs REPEATABLE READ in practice

For a blockchain indexer reading aggregate statistics alongside writes, READ COMMITTED is usually correct — you want each query to see the latest committed state. For a financial report that must be consistent across multiple queries (e.g. balance sheet that must balance), REPEATABLE READ is essential.

sql
-- Financial report that must be consistent BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT SUM(amount) FROM transactions WHERE status = 'confirmed'; SELECT COUNT(*) FROM transactions WHERE status = 'confirmed'; -- Both queries see the same snapshot — the counts will be consistent COMMIT;

SERIALIZABLE and SSI

SERIALIZABLE goes one step further. It not only takes a consistent snapshot, it also tracks read-write dependencies between concurrent transactions to detect non-serialisable execution orders.

sql
-- Transaction T1 BEGIN ISOLATION LEVEL SERIALIZABLE; SELECT SUM(amount) FROM transactions WHERE sender = '0xabc'; -- Transaction T2 (concurrent) BEGIN ISOLATION LEVEL SERIALIZABLE; INSERT INTO transactions (sender, amount, ...) VALUES ('0xabc', 500, ...); COMMIT; -- T1 continues INSERT INTO transactions (sender, amount, ...) VALUES ('0xabc', -500, ...); -- ERROR: could not serialize access due to read/write dependencies among transactions

SSI detects that T1 read a value that T2 modified, creating a cycle in the dependency graph. One transaction must be aborted. The application must retry.

SSI is powerful but has a CPU overhead (predicate locking) and a retry burden. For most OLTP workloads, READ COMMITTED with careful application-level conflict handling is more practical.


The True Cost of MVCC: Dead Tuples and Heap Bloat

Every UPDATE and DELETE in Postgres leaves dead tuples on disk. This is not an edge case — it is the fundamental price of MVCC.

How bloat accumulates

Consider a table that receives 1,000 updates per second. Each update:

  1. Marks the old row version as dead (sets t_xmax)
  2. Inserts a new row version

After one hour: 3.6 million dead tuples. After one day: 86.4 million dead tuples.

Autovacuum reclaims this space, but only if it is tuned aggressively enough to keep up with the update rate. A misconfigured autovacuum on a high-write table will fall behind, and the table will grow without bound even if the number of live rows is constant.

Measuring bloat

sql
-- Using pgstattuple (requires pg_stat_extension) CREATE EXTENSION pgstattuple; SELECT table_len, tuple_count, tuple_len, tuple_percent, dead_tuple_count, dead_tuple_len, dead_tuple_percent, free_space, free_percent FROM pgstattuple('transactions');
 table_len       | 10737418240  (10 GB)
 tuple_count     | 48000000
 tuple_len       | 9663676416   (9.0 GB)
 tuple_percent   | 90.0
 dead_tuple_count| 2400000
 dead_tuple_len  | 483328000    (461 MB)
 dead_tuple_percent | 4.5
 free_space      | 547356672    (522 MB)
 free_percent    | 5.1

4.5% of this table is dead tuples. On a 10GB table, that is 461MB of wasted space that autovacuum has not yet reclaimed.

The long-running transaction problem

Dead tuples cannot be reclaimed by VACUUM while any active transaction's snapshot can still "see" them. Even a committed snapshot (one taken before the deleting transaction) prevents dead tuple reclamation.

This is why long-running transactions cause bloat even without performing any writes:

sql
-- In session 1: start a long-running read transaction BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT COUNT(*) FROM transactions; -- Don't commit for 2 hours -- In session 2: perform 10 million updates UPDATE transactions SET status = 'archived' WHERE timestamp < now() - interval '1 year'; -- Creates 10 million dead tuples -- VACUUM runs but cannot reclaim the dead tuples -- because session 1's snapshot still references them VACUUM transactions; -- WARNING: some dead tuples could not be removed due to recently active snapshots

To find the oldest transaction holding back vacuum:

sql
SELECT pid, usename, application_name, state, backend_xid, backend_xmin, now() - xact_start AS transaction_duration, query FROM pg_stat_activity WHERE backend_xmin IS NOT NULL AND now() - xact_start > interval '10 minutes' ORDER BY xact_start;

Any session with a very old backend_xmin is preventing vacuum from cleaning up dead tuples. If these sessions are idle-in-transaction, they need to be terminated.


Transaction ID Wraparound: The Ticking Clock

The 32-bit XID counter will eventually reach 2^32. When it does, newly created rows will appear to be in the "future" relative to older rows — a logical impossibility that breaks all visibility calculations.

Postgres prevents this catastrophe through transaction ID freezing: periodically replacing old t_xmin values with a special value FrozenTransactionId (XID 2). Frozen rows are always visible to all transactions, bypassing the visibility check entirely.

The age of a relation

sql
-- How "old" is each table in terms of transaction IDs? SELECT relname, age(relfrozenxid) AS xid_age, pg_size_pretty(pg_total_relation_size(oid)) AS size FROM pg_class WHERE relkind = 'r' ORDER BY age(relfrozenxid) DESC LIMIT 10;
     relname     | xid_age  |  size
-----------------+----------+--------
 transactions    | 1847263  | 10 GB
 blocks          |  934821  | 4 GB
 events          |  621044  | 2 GB

The xid_age is the number of transactions since the table's oldest unfrozen XID was created. When this exceeds autovacuum_freeze_max_age (default: 200 million), autovacuum is forced to run a freeze operation on the table.

The danger zone

Safe zone:      xid_age < 200,000,000
Warning zone:   xid_age > 500,000,000
Emergency zone: xid_age > 1,500,000,000
Shutdown:       xid_age > 2,100,000,000

When xid_age reaches ~1.6 billion (configurable via vacuum_freeze_max_age), Postgres enters a safe-mode where it will refuse all new connections except to allow a superuser to run VACUUM FREEZE.

sql
-- Monitor the danger level across all databases SELECT datname, age(datfrozenxid) AS database_age, round(age(datfrozenxid)::numeric / 2100000000 * 100, 1) AS pct_danger FROM pg_database ORDER BY age(datfrozenxid) DESC;

VACUUM FREEZE: the solution

sql
-- Force a freeze on a specific table (can be done online) VACUUM FREEZE transactions; -- Check remaining danger level SELECT relname, age(relfrozenxid), pg_size_pretty(pg_total_relation_size(oid)) FROM pg_class WHERE relname = 'transactions';

After VACUUM FREEZE, all tuples have t_xmin = FrozenTransactionId. The VM all-frozen bits are set. Future freeze vacuums skip these pages entirely.

Autovacuum freeze configuration

The key parameters:

sql
-- Per-table override for high-write tables ALTER TABLE transactions SET ( autovacuum_vacuum_cost_delay = 2, -- ms, lower = more aggressive autovacuum_vacuum_cost_limit = 400, -- higher = more work per vacuum pass autovacuum_freeze_max_age = 100000000, -- trigger freeze at 100M transactions autovacuum_multixact_freeze_max_age = 100000000 );

For your most critical tables, set autovacuum_freeze_max_age well below the global default of 200 million. This triggers more frequent freeze vacuums and keeps you far from the danger zone.


MVCC and Index Overhead

Indexes in Postgres do not store MVCC visibility information. Every index entry points to a heap tuple, and heap tuple visibility must be checked at query time.

This has two consequences:

1. Index scans are slower than they appear. For every row an index scan returns, Postgres must visit the heap to check visibility. The only exception is Index-Only Scans when the Visibility Map confirms the page is all-visible (covered in Module 1).

2. Dead index entries accumulate. When a row is deleted or updated (creating a dead heap tuple), the corresponding index entries are not immediately removed. They remain in the index until VACUUM cleans them up. On a heavily-updated table with many indexes, dead index entries can make indexes significantly larger than the live data.

sql
-- Check index bloat for a specific table SELECT indexrelid::regclass AS index_name, pg_size_pretty(pg_relation_size(indexrelid)) AS index_size, idx_scan AS scans, idx_tup_read AS tuples_read, idx_tup_fetch AS tuples_fetched FROM pg_stat_user_indexes WHERE relid = 'transactions'::regclass ORDER BY pg_relation_size(indexrelid) DESC;

If tuples_read is much larger than tuples_fetched, the index is reading many dead entries that the heap then rejects — a sign the index needs vacuuming.


Production Example: The MVCC-Induced Bloat Incident

The situation: A blockchain indexer table starts at 8GB. After three months of production with no bloat issues, it grows to 28GB. pg_stat_user_tables shows only 50 million live rows — which should fit in about 10GB.

Step 1: Confirm bloat with pgstattuple

sql
SELECT dead_tuple_count, dead_tuple_percent, pg_size_pretty(dead_tuple_len::bigint) AS dead_size, free_percent FROM pgstattuple('transactions');
 dead_tuple_count | dead_tuple_percent | dead_size | free_percent
------------------+--------------------+-----------+-------------
        18500000  |               65.2 | 17 GB     | 3.1

65% of the table is dead tuples. 17GB of dead rows.

Step 2: Find what is preventing vacuum

sql
SELECT pid, usename, state, backend_xmin, now() - xact_start AS age, left(query, 80) AS query FROM pg_stat_activity WHERE backend_xmin IS NOT NULL ORDER BY backend_xmin;
  pid  | usename  | state             | backend_xmin |    age     
-------+----------+-------------------+--------------+------------
 23481 | analytics| idle in transaction| 4813245     | 18:24:09.8

An analytics session has been idle-in-transaction for 18 hours. Its snapshot (XID 4813245) predates all the dead tuples we want to reclaim. As long as this transaction is open, VACUUM cannot touch them.

Step 3: Terminate the blocking session

sql
-- First, check with the team that this is safe to terminate SELECT pg_terminate_backend(23481); -- Confirm it's gone SELECT pid FROM pg_stat_activity WHERE pid = 23481; -- (0 rows)

Step 4: Run VACUUM and verify

sql
VACUUM ANALYZE transactions; -- Check bloat again SELECT dead_tuple_count, dead_tuple_percent, pg_size_pretty(dead_tuple_len::bigint) AS dead_size FROM pgstattuple('transactions');
 dead_tuple_count | dead_tuple_percent | dead_size
------------------+--------------------+-----------
                0 |                0.0 | 0 bytes

But the table is still 28GB. VACUUM reclaims space within pages (marks it as available for reuse) but does not return pages to the OS. The table will shrink over time as new inserts reuse the free space.

If you need the space returned to the OS immediately:

sql
-- WARNING: Takes ACCESS EXCLUSIVE lock. Not suitable for production tables. -- Consider pg_repack instead (covered in Module 4) VACUUM FULL transactions;

Step 5: Prevent recurrence with idle-in-transaction timeouts

sql
-- Set a global timeout for idle-in-transaction sessions ALTER SYSTEM SET idle_in_transaction_session_timeout = '15min'; SELECT pg_reload_conf(); -- Or per role for analytics users ALTER ROLE analytics SET idle_in_transaction_session_timeout = '5min'; -- Verify SHOW idle_in_transaction_session_timeout;

With this setting, any session that remains idle in transaction for more than 15 minutes is automatically terminated. This is the most important MVCC-related configuration for any production database that has analytics or reporting queries running alongside writes.


Summary

ConceptKey MechanicProduction Implication
XID32-bit counterMonitor age; freeze before wraparound
t_xmin / t_xmaxRow version timestampsEvery UPDATE creates dead tuples
SnapshotTaken at statement or transaction startIsolation level determines when
READ COMMITTEDNew snapshot per statementCan see non-repeatable reads
REPEATABLE READSnapshot at first statementConsistent reads, serialisation errors
Dead tuplesAccumulate on every UPDATE/DELETEAutovacuum must keep up
Long transactionsHold back VACUUMSet idle_in_transaction_session_timeout
XID wraparound2.1B transaction limitMonitor age(relfrozenxid)
FreezeSets t_xmin = FrozenXIDEnables VM all-frozen, safe from wraparound

In Module 3, we follow writes from application code all the way to durable storage — through the Write-Ahead Log. WAL is what makes MVCC crash-safe, what makes replication possible, and what explains your write amplification.

Next: Module 3 — Write-Ahead Logging: Durability, Replication, and the Price of Every Write →

Discussion