Module 10·40 min read

Replication is not a backup. Understanding the consistency guarantees of each replication mode is non-negotiable.

Module 10 — Replication, High Availability, and the CAP Trade-offs

What this module covers: Replication is not a backup. High availability is not the same as durability. Failover is not free. This module covers the full replication stack in Postgres — how WAL streaming works at the protocol level, exactly what each synchronous_commit mode guarantees, how failover happens and what it costs, how to build an HA cluster that survives real failures, and how CAP theorem applies concretely to every Postgres HA configuration you will encounter in production.

Replication Is Not a Backup

This is the first thing to establish because conflating them causes real disasters.

Replication propagates changes from primary to standby in near-real-time. If you DROP TABLE transactions on the primary, that statement replicates to every standby within milliseconds. All copies lose the table simultaneously.

Backup is a point-in-time snapshot that is not affected by live changes. A backup from yesterday still has the table after you accidentally drop it today.

You need both. They solve different problems:

Replication solves availability — keeping a hot standby ready to take over if the primary fails
Backup solves recoverability — restoring to a known good state after data corruption, accidental deletion, or application bugs

With that established: here is how replication works.

Physical Replication: The Full Protocol

Physical (streaming) replication was introduced in Postgres 9.0 and is the foundation of all Postgres HA setups. Every byte on the standby is an exact copy of the primary — same heap files, same indexes, same system catalog.

The Connection Sequence

The standby's WAL receiver process connects to the primary using a replication connection (a special connection type — replication=true in the connection string)
The primary's WAL sender process handles this connection
They negotiate: the standby sends its current replay LSN, the primary starts streaming WAL from that point
WAL records flow continuously from primary to standby

sql
-- On the primary: view active WAL sender connections
SELECT
  pid,
  application_name,
  client_addr,
  state,           -- 'streaming', 'catchup', 'backup', 'stopping'
  sent_lsn,
  write_lsn,
  flush_lsn,
  replay_lsn,
  sync_state       -- 'async', 'sync', 'potential', 'quorum'
FROM pg_stat_replication;

The four LSN columns track where the standby is in the pipeline:

sent_lsn — WAL sent by the primary's WAL sender
write_lsn — WAL written to standby's disk (received but not fsynced)
flush_lsn — WAL fsynced to standby's disk (durable on standby)
replay_lsn — WAL applied to standby's heap files (visible to queries on standby)

The gap between each adjacent pair is a different form of lag. A healthy streaming standby has all four within a few KB of each other.

The Standby's Startup Process

On the standby, the startup process runs in continuous recovery mode (Module 3). It reads WAL from pg_wal/ (written by the WAL receiver) and applies each record to the heap, exactly as it would during crash recovery — except it never stops. New WAL keeps arriving and being applied.

This is why a standby cannot accept write queries: it is perpetually replaying, and accepting writes would conflict with the recovery process. It can accept read queries (with hot_standby = on) because reads only need visibility snapshots, which the startup process maintains.

`hot_standby_feedback`

When a query runs on the standby, it holds an MVCC snapshot. If the primary vacuums and freezes tuples that the standby's query still needs, the standby query will get:

ERROR: could not serialize access due to concurrent update

hot_standby_feedback prevents this by having the standby report its oldest active transaction XID to the primary. The primary's autovacuum respects this and does not reclaim tuples that are still needed by standby queries.

ini
# On standby
hot_standby_feedback = on   # prevents cancellation of standby queries by vacuum

The cost: hot_standby_feedback = on can delay autovacuum on the primary if long-running standby queries hold old snapshots. This is the same dead tuple accumulation problem as long-running transactions on the primary (Module 4) — except it originates from the standby.

Synchronous Replication Modes: Exact Semantics

Module 3 introduced the synchronous_commit settings. Here is the full precision of what each one guarantees and costs.

The Five Modes

ini
synchronous_standby_names = 'standby1'  # required for sync replication

synchronous_commit = on (default)

Guarantee: WAL flushed to primary disk before COMMIT returns
What it does NOT guarantee: standby has received or applied the transaction
If primary crashes and is unrecoverable: promote standby → standby may be missing recent transactions
Latency added: none beyond local fsync (~1ms on SSD)

synchronous_commit = remote_write

Guarantee: WAL written (not fsynced) to standby's OS buffer before COMMIT returns
What it does NOT guarantee: standby's data survives a standby crash (OS buffer not yet flushed)
If standby crashes immediately after: up to one fsync cycle of data at risk on standby
Latency added: network RTT to standby (~1ms LAN, ~5ms same-region cloud)

synchronous_commit = remote_apply

Guarantee: transaction applied and visible on standby before COMMIT returns
What it does NOT guarantee: nothing — this is the strongest mode
Read-after-write from standby is consistent: a client that committed on primary can immediately query standby and see the committed data
Latency added: network RTT + standby apply time (~2–10ms)

synchronous_commit = local

Guarantee: WAL flushed to primary disk; standby is completely async regardless of synchronous_standby_names
Use: when you have a sync standby configured but specific transactions should not wait for it
Latency added: same as on (local fsync only)

synchronous_commit = off

Guarantee: COMMIT returns immediately; WAL flushed within wal_writer_delay (200ms default)
Risk: up to 200ms of committed transactions lost on crash
Latency added: none — fastest possible commit

Quorum Commit (Postgres 10+)

ini
# COMMIT waits for any 2 of 3 standbys to confirm
synchronous_standby_names = 'ANY 2 (standby1, standby2, standby3)'

# COMMIT waits for ALL listed standbys
synchronous_standby_names = 'FIRST 2 (standby1, standby2)'
# or equivalently
synchronous_standby_names = 'standby1, standby2'

Quorum commit is the production-correct setup for high availability: the primary does not stall if one standby is down or lagging, as long as the quorum is met. With ANY 2 (s1, s2, s3): one standby can be down for maintenance without affecting commit latency.

Replication Slots: Guarantees and Risks

A replication slot guarantees that the primary retains WAL until the standby confirms it has consumed everything up to that point.

sql
-- Create a physical replication slot for a standby
SELECT pg_create_physical_replication_slot('standby1_slot');

-- View slot state
SELECT
  slot_name,
  slot_type,
  active,
  restart_lsn,
  confirmed_flush_lsn,
  pg_size_pretty(
    pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
  ) AS retained_wal
FROM pg_replication_slots;

The WAL Accumulation Risk

As covered in Module 4: a replication slot that becomes inactive (standby goes down, network partition) retains all WAL from its restart_lsn forward. This accumulation is unbounded — Postgres will fill your disk before dropping WAL that a slot needs.

ini
# Cap WAL retention per slot (Postgres 13+)
max_slot_wal_keep_size = 10GB  # slot is invalidated if WAL would exceed this

When a slot is invalidated due to max_slot_wal_keep_size, the standby must perform a new base backup to reconnect — it cannot resume streaming from where it left off.

Standbys Without Slots

A standby can connect without a replication slot. In this case:

The primary does not retain WAL specifically for this standby
If the standby falls too far behind (beyond wal_keep_size), the standby gets:
```
ERROR: requested WAL segment has already been removed
```
The standby must take a new base backup

The trade-off: slots guarantee WAL retention (risk: disk fill). No slot provides no guarantee but no disk fill risk.

For production: use slots with max_slot_wal_keep_size set. The worst case (slot invalidated) requires a base backup. The alternative worst case (no slot, disk full) requires emergency intervention on the primary.

Point-in-Time Recovery (PITR)

PITR allows restoring a Postgres cluster to any point in time, not just the latest backup. It requires:

A base backup (full physical copy of $PGDATA)
Continuous WAL archiving from the backup point to the target recovery time

ini
# WAL archiving configuration
archive_mode = on
archive_command = 'cp %p /mnt/wal_archive/%f'
archive_timeout = 300  # force WAL segment switch after 5 minutes of inactivity

Recovery Target Modes

ini
# Recover to a specific timestamp
recovery_target_time = '2026-05-17 14:30:00 UTC'

# Recover to a specific LSN
recovery_target_lsn = '2/4F3A1820'

# Recover to just before a specific transaction ID
recovery_target_xid = '504312847'

# Recover to a named restore point (created with pg_create_restore_point())
recovery_target_name = 'before_bad_migration'

# What to do after reaching the target
recovery_target_action = 'promote'  # become a primary
recovery_target_action = 'pause'    # pause for inspection before promoting
recovery_target_action = 'shutdown' # stop (for manual inspection)

The PITR Workflow

bash
# 1. Take a base backup (while primary is running)
pg_basebackup -h primary -D /var/lib/postgresql/recovery -P -Xs -R

# 2. Configure recovery target in postgresql.conf on the recovery instance
echo "recovery_target_time = '2026-05-17 14:30:00 UTC'" >> /var/lib/postgresql/recovery/postgresql.conf
echo "recovery_target_action = 'pause'" >> /var/lib/postgresql/recovery/postgresql.conf
echo "restore_command = 'cp /mnt/wal_archive/%f %p'" >> /var/lib/postgresql/recovery/postgresql.conf

# 3. Start the recovery instance — it replays WAL until the target
pg_ctl start -D /var/lib/postgresql/recovery

# 4. Verify the recovered state
psql -c "SELECT count(*) FROM transactions;"  # check table exists and has expected data

# 5. If satisfied, promote to writable
psql -c "SELECT pg_promote();"

pg_basebackup -R creates standby.signal and writes the connection info to postgresql.auto.conf, making the recovery instance ready to connect as a standby — useful for "restore and promote" workflows.

Failover: What Actually Happens

Failover — promoting a standby to primary — is conceptually simple but operationally complex. Here is the precise sequence of events.

Manual Failover

bash
# On the standby — promote to primary
pg_ctl promote -D /var/lib/postgresql/data

# Or via SQL (Postgres 12+)
psql -c "SELECT pg_promote();"

When promoted:

The startup process stops waiting for WAL from the old primary
A new timeline is started (timeline ID increments by 1)
A timeline history file is written to pg_wal/ recording when the timeline changed
The standby becomes a writable primary on the new timeline

The application must be updated to point to the new primary's address. This is why automatic failover tooling exists — manual failover with application reconfiguration is too slow for most production SLAs.

What Happens to In-Flight Transactions

At the moment of failover, there are potentially two classes of transactions on the old primary:

Committed but not yet replicated (async mode): These transactions are committed on the old primary's WAL but the standby has not received them. After promotion, these transactions are permanently lost. Clients received COMMIT acknowledgement for data that no longer exists.

This is why synchronous_commit = remote_apply or at minimum remote_write is required for financial workloads — it ensures the standby has the data before the client is told the commit succeeded.

In-progress at failover: These transactions were not committed. The new primary's MVCC state correctly shows them as aborted (their xmin is not in the committed set). No data loss, no inconsistency.

Timeline Divergence

After promotion, the new primary writes new WAL on timeline 2. If the old primary comes back online (e.g., it was a network partition, not a hardware failure), it has WAL on timeline 1 that the new primary does not have — split brain.

Postgres handles this via timeline history files. A standby connecting to the new primary will detect the timeline switch and refuse to replay WAL that conflicts with the new timeline. Properly configured tooling ensures the old primary is fenced (prevented from accepting writes) before the standby is promoted.

Automatic Failover: Patroni

Patroni is the standard open-source HA solution for Postgres. It runs alongside each Postgres instance and uses a distributed consensus store (etcd, Consul, or ZooKeeper) to coordinate leader election.

Architecture

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│  Primary    │────▶│   etcd/      │◀────│  Standby 1  │
│  Patroni    │     │   Consul     │     │  Patroni    │
│  agent      │     │   cluster    │     │  agent      │
└─────────────┘     └──────────────┘     └─────────────┘
       │                                        │
       ▼                                        ▼
  PostgreSQL                              PostgreSQL
  (primary)                               (standby)

Each Patroni agent:

Holds a leader lease in etcd (renewed every ttl seconds, default 30s)
Monitors the local Postgres instance health
If the primary's agent fails to renew its lease → other agents race to acquire it → winner promotes its standby

The Fencing Problem

Before promoting, Patroni must ensure the old primary is not still running and accepting writes. This is done via:

pg_ctl stop -m fast — Patroni attempts to stop the old primary before promoting
Watchdog — a kernel module (softdog or hardware watchdog) that reboots the primary node if Patroni fails to stop Postgres
STONITH (Shoot The Other Node In The Head) — cloud API calls to forcibly terminate the instance

Without fencing, split-brain is possible: both nodes accept writes, data diverges, and you have a corrupt cluster.

Patroni Configuration Snippet

yaml
# patroni.yml
scope: postgres-cluster
name: primary-node

etcd3:
  hosts: etcd1:2379,etcd2:2379,etcd3:2379

bootstrap:
  dcs:
    ttl: 30                    # leader lease duration
    loop_wait: 10              # health check interval
    retry_timeout: 10
    maximum_lag_on_failover: 1048576  # 1MB — don't promote if too far behind

  postgresql:
    parameters:
      synchronous_commit: "on"
      synchronous_standby_names: "ANY 1 (standby1,standby2)"

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 10.0.0.1:5432
  data_dir: /var/lib/postgresql/data
  pgpass: /tmp/pgpass
  authentication:
    replication:
      username: replicator
      password: replicatorpass

watchdog:
  mode: required   # Patroni refuses to start without watchdog
  device: /dev/watchdog

maximum_lag_on_failover is critical: it prevents promoting a standby that is far behind the primary from becoming the new primary with stale data. A standby more than 1MB behind will not be promoted — Patroni waits for it to catch up or picks another standby.

Read Replicas: What Is Safe to Route There

A standby in hot_standby mode accepts read queries. But read replicas have consistency properties that must be understood before routing traffic to them.

The Read-After-Write Problem

Timeline:
  t=0: Client writes row to primary (COMMIT succeeds)
  t=1: Client reads from standby
  t=2: Standby applies the write (apply_lag = 2 units)

If apply_lag > 0, a client that just wrote to the primary may not see its own write when reading from the standby. For most web applications, this is an invisible UX problem: a user updates their profile, is redirected to the profile page, and sees the old version.

Solutions:

1. Route writes and immediately-following reads to primary

// Application routing logic
if (request.justWrote) {
  db = primaryConnection;
} else {
  db = replicaConnection;
}

2. Use synchronous_commit = remote_apply for writes that must be immediately readable on standbys. The commit does not return until the standby has applied it — guaranteed read-after-write.

3. Read your own writes via session pinning — route all requests in a session to the same backend until a delay expires.

Queries That Are Always Unsafe on Standbys

sql
-- These fail or return wrong results on a hot standby:

-- 1. Any write — obviously
INSERT INTO transactions ...;
-- ERROR: cannot execute INSERT in a read-only transaction

-- 2. Advisory locks (used for distributed coordination)
SELECT pg_advisory_lock(12345);
-- May succeed but the lock exists only on the standby — not coordinated with primary

-- 3. Sequences (nextval) — return values from the standby's local state
SELECT nextval('transactions_id_seq');
-- Returns a value that may collide with primary's sequence

-- 4. Functions with VOLATILE that write side-effects
-- Check function volatility before using on standby

The Standby Query Cancellation Problem

Long-running queries on standbys can be cancelled by the recovery process when it needs to apply a conflicting WAL record (e.g., a vacuum that reclaims pages the standby query is reading).

ini
# How long to wait before cancelling standby queries for recovery
max_standby_streaming_delay = 30s  # default; increase for long analytical queries
max_standby_archive_delay = 30s    # same, for standbys recovering from archive

With hot_standby_feedback = on (covered above), the primary's autovacuum avoids reclaiming tuples still needed by standby queries — but this delays autovacuum on the primary.

For analytics replicas running long queries: set max_standby_streaming_delay = -1 (never cancel) and accept that the standby may fall behind during heavy analytical workloads.

Logical Replication

Physical replication copies every byte. Logical replication copies row-level changes decoded from WAL — it is table-selective, schema-independent, and can replicate to a different Postgres version or even a different database system.

sql
-- On the publisher (primary)
CREATE PUBLICATION blockchain_pub FOR TABLE transactions, blocks;

-- On the subscriber (another Postgres instance)
CREATE SUBSCRIPTION blockchain_sub
  CONNECTION 'host=primary dbname=indexer user=replicator'
  PUBLICATION blockchain_pub;

When Logical Replication Is the Right Tool

Zero-downtime major version upgrades: set up logical replication to a new-version standby, let it catch up, switch application over
Selective replication: replicate only specific tables to a reporting database
Cross-database replication: replicate between different databases on the same or different servers
CDC consumers: Debezium, custom consumers via pg_logical or the replication protocol

The Logical Replication Write Cost

wal_level = logical (required for logical replication) writes more WAL than replica — it includes full column values for decoded change events. On write-heavy tables, this can be 10–30% more WAL volume.

sql
-- Monitor logical replication lag
SELECT
  subname,
  received_lsn,
  latest_end_lsn,
  pg_size_pretty(
    pg_wal_lsn_diff(pg_current_wal_lsn(), received_lsn)
  ) AS lag
FROM pg_stat_subscription;

CAP Theorem Applied to Postgres HA

CAP theorem states that a distributed system can guarantee at most two of: Consistency, Availability, Partition tolerance.

All production systems must tolerate network partitions (P). The real choice is between consistency (C) and availability (A) during a partition.

Postgres Default Configuration: CP

In the default async configuration with automatic failover (Patroni):

During normal operation: consistent reads and writes, highly available.

During a network partition:

If the primary cannot reach etcd, it cannot renew its lease → Patroni demotes it (stops accepting writes) rather than risk split-brain
The cluster is unavailable for writes until the partition heals or a standby is promoted
C is preserved (no split-brain), A is sacrificed (writes blocked)

This is Postgres's default choice: consistency over availability during partitions.

Synchronous Replication: Stronger C, Weaker A

With synchronous_commit = remote_apply and synchronous_standby_names = 'ANY 1 (s1, s2)':

A commit only succeeds if at least one standby confirms it
If all standbys become unreachable: primary blocks on all commits until they reconnect
Stronger durability guarantee (no data loss on failover) at the cost of availability during standby outages

Async Replication: Weaker C, Stronger A

With synchronous_commit = off and no synchronous_standby_names:

Commits succeed regardless of standby state
If primary fails and standby is promoted: committed transactions in the last 200ms may be lost
Available during partial outages, but weaker durability

The Practical Matrix

Configuration	Durability on Failover	Write Availability	Commit Latency
Async replication	Possible data loss (200ms window)	High	Lowest
`remote_write`	Possible loss if standby crashes	High	Low (+RTT)
`remote_apply`	No loss	Medium	Medium (+RTT+apply)
`remote_apply` + quorum `ANY 2/3`	No loss	High (tolerates 1 standby down)	Medium
All sync, `FIRST N`	No loss	Low (all standbys must be up)	Highest

For most production systems: quorum remote_apply with 2+ standbys is the correct balance. No data loss on failover, tolerates one standby failure without blocking writes.

Monitoring: The Queries That Matter

sql
-- 1. Replication lag in bytes and time (primary)
SELECT
  application_name,
  sync_state,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn))  AS send_lag,
  pg_size_pretty(pg_wal_lsn_diff(sent_lsn, flush_lsn))             AS flush_lag,
  pg_size_pretty(pg_wal_lsn_diff(flush_lsn, replay_lsn))           AS apply_lag,
  write_lag,
  flush_lag,
  replay_lag
FROM pg_stat_replication;

-- 2. Current lag in seconds (standby)
SELECT
  now() - pg_last_xact_replay_timestamp() AS replication_lag,
  pg_is_in_recovery()                     AS is_standby,
  pg_last_wal_receive_lsn()               AS received_lsn,
  pg_last_wal_replay_lsn()                AS replayed_lsn;

-- 3. Slots that are accumulating WAL (primary)
SELECT
  slot_name,
  active,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;

-- 4. Conflicts on standby
SELECT
  datname,
  confl_tablespace,
  confl_lock,
  confl_snapshot,  -- standby query cancelled because primary needed to vacuum
  confl_bufferpin,
  confl_deadlock
FROM pg_stat_database_conflicts;
-- confl_snapshot > 0 = queries being cancelled; tune hot_standby_feedback or max_standby_streaming_delay

The Production Incident: Split-Brain From an Incomplete Fence

Context: A blockchain indexer with Patroni + etcd, primary and two standbys across two availability zones.

What happened:

A network partition isolated the primary from both etcd and the standbys. From the primary's perspective, etcd was unreachable — it could not renew its leader lease. Patroni's configured behavior: demote the primary by stopping Postgres after ttl seconds.

However, the watchdog device (/dev/watchdog) was not properly configured — watchdog_mode was set to automatic instead of required. When Patroni tried to use the watchdog for fencing, it silently skipped it (automatic mode treats watchdog as optional).

Patroni stopped Postgres on the primary via pg_ctl stop. But the stop command timed out (a long-running VACUUM was holding the shutdown). Patroni logged the timeout and proceeded to exit — but Postgres was still running.

Meanwhile, etcd detected the expired lease. One standby won the election and promoted itself.

For approximately 90 seconds, both nodes were accepting writes:

The old primary: still running, accepting connections from applications that had not yet received the DNS failover
The new primary: accepting writes from applications that had received the DNS failover

Both wrote conflicting data to the transactions table. After the network partition healed, the conflict was discovered during reconciliation — rows with duplicate block_height + sender combinations existed in different timelines.

The resolution:

The old primary's timeline diverged at LSN 3/A2F1C820. Every write after that point on the old primary was discarded. The new primary's timeline was authoritative. 90 seconds of writes on the old primary — approximately 180,000 transactions at 2,000 TPS — were permanently lost.

The fixes:

yaml
# 1. Require watchdog — refuse to run without it
watchdog:
  mode: required    # was: automatic
  device: /dev/watchdog

# 2. Set a hard timeout on pg_ctl stop
postgresql:
  pg_ctl_timeout: 10   # seconds; don't wait forever for a clean shutdown

# 3. Add a pre-promote callback to double-check the old primary is truly down
postgresql:
  callbacks:
    on_start: /etc/patroni/callbacks/on_start.sh
    on_stop: /etc/patroni/callbacks/on_stop.sh
    on_role_change: /etc/patroni/callbacks/fence_old_primary.sh

bash
# fence_old_primary.sh — STONITH via cloud API before promoting
#!/bin/bash
OLD_PRIMARY_INSTANCE_ID=$(cat /etc/patroni/old_primary_id)
aws ec2 stop-instances --instance-ids $OLD_PRIMARY_INSTANCE_ID
sleep 5
aws ec2 describe-instances --instance-ids $OLD_PRIMARY_INSTANCE_ID \
  --query 'Reservations[].Instances[].State.Name' \
  --output text | grep -q "stopped" || exit 1

The lessons:

watchdog_mode = required — never automatic in production
pg_ctl_timeout prevents indefinite hanging during failover
STONITH via cloud API is the reliable fence for cloud deployments
Test your failover. Regularly. Chaos engineering for databases is not optional.

Summary

Concept	Key Takeaway
Replication vs backup	Replication propagates deletes and corruption. Backup provides a restore point. You need both.
`synchronous_commit`	Five modes from `off` (no durability) to `remote_apply` (standby must apply before commit returns)
Replication slots	Guarantee WAL retention; inactive slots fill disk. Set `max_slot_wal_keep_size`.
PITR	Requires base backup + continuous WAL archive. Test it quarterly.
Failover	In-progress transactions are safe. Committed-but-not-replicated transactions are lost in async mode.
Patroni	Standard HA tooling. `watchdog_mode = required`. STONITH is mandatory for split-brain prevention.
Read replicas	Apply lag means reads may be stale. `hot_standby_feedback` prevents query cancellation but delays autovacuum.
Logical replication	Table-selective, version-independent. Required for zero-downtime major version upgrades.
CAP theorem	Default Postgres = CP. Async replication = leans toward AP. Quorum `remote_apply` = best practical balance.

Module 11 steps back from Postgres-specific mechanics to answer the question engineers argue about constantly but rarely reason through rigorously: when is Postgres the right database, and when is it not?

Next: Module 11 — SQL vs NoSQL: An Engineer's Framework, Not a Marketing Debate →

PreviousModule 9: The RETURNING Clause Evolved: OLD/NEW Aliases and Eliminating Race Conditions Next Module 11: SQL vs NoSQL: An Engineer's Framework, Not a Marketing Debate