Module 7·38 min read

Every schema decision on day one compounds at 100x data volume. This module covers the decisions that matter most.

Module 7 — Schema Design at Scale: Decisions That Cannot Be Undone

What this module covers: Schema decisions made on day one compound at 100x data volume. Choosing the wrong primary key type, the wrong partition strategy, or the wrong normalization level creates constraints that are expensive or impossible to reverse without downtime. This module covers the decisions that matter most — not as rules to follow, but as trade-offs to understand so you can derive the right answer for your specific workload.

Why Schema Decisions Are Different From Other Decisions

Most engineering decisions are reversible. You can refactor code, swap a library, change an API response shape, rewrite a service. The cost is development time.

Schema decisions in a production database are different. Once a table has 500 million rows:

Changing a column type requires a full table rewrite (hours of downtime, or a complex online migration)
Adding a NOT NULL column requires a default or a multi-step migration to avoid locking
Changing a primary key type (INT → BIGINT) requires rewriting the table and every foreign key table
Repartitioning requires rebuilding the entire table and its indexes
Removing normalization (collapsing two tables) requires a data migration and application changes across every service that touches both tables

The cost of reversing a bad schema decision grows linearly with data volume and quadratically with the number of services that depend on the schema. At 10 billion rows, some decisions cannot be reversed at all without extended downtime.

This is why schema design deserves disproportionate thought at the start.

Data Type Choices: The Foundation

Integer Primary Keys: INT vs BIGINT

INT (32-bit signed) holds values up to 2,147,483,647 (~2.1 billion). BIGINT (64-bit signed) holds up to 9,223,372,036,854,775,807 (~9.2 quintillion).

The INT vs BIGINT decision seems trivial. It has caused production outages at multiple large companies.

At 1,000 inserts/second, an INT primary key exhausts in:

2,147,483,647 / 1,000 / 86,400 / 365 ≈ 68 years

At 10,000 inserts/second:

2,147,483,647 / 10,000 / 86,400 / 365 ≈ 6.8 years

At 100,000 inserts/second (realistic for a blockchain indexer):

2,147,483,647 / 100,000 / 86,400 / 365 ≈ 249 days

When an INT sequence exhausts, inserts fail with:

ERROR: integer out of range

Migrating from INT to BIGINT on a 500M-row table with 8 foreign key tables is a multi-day operation requiring careful sequencing of table rewrites and application deployments.

Rule: always use BIGINT for primary keys. The 4-byte difference per row is irrelevant at any scale. The migration cost if you're wrong is catastrophic.

sql
-- Correct
id BIGSERIAL PRIMARY KEY

-- This will haunt you at scale
id SERIAL PRIMARY KEY  -- SERIAL = INT, not BIGINT

UUID vs BIGSERIAL: The Real Trade-off

UUIDs are attractive for distributed systems: generated client-side, globally unique without coordination, naturally partition-friendly. The cost is real and often underestimated.

UUIDs as primary keys on a B-tree index:

A UUID v4 is random. Inserting random UUIDs into a B-tree causes random page splits (Module 5) — every insert goes to a random leaf page instead of the rightmost page. This:

Defeats the sequential-insert optimization in B-tree
Causes all B-tree leaf pages to sit at ~50% fill factor
Makes the primary key index 2x larger than it needs to be
Increases write I/O: every insert touches a random cached page rather than the current append page

For a 500M-row table, the UUID primary key index might be 30GB instead of the 15GB it would be with BIGSERIAL. That's 15GB of extra cache pressure and 15GB of extra WAL on every full page write cycle.

UUID v7 (ordered): UUID v7 uses a time-based prefix, making them monotonically increasing. This restores the sequential-insert optimization. If you need UUIDs for distribution, use v7.

sql
-- PostgreSQL 17+ has built-in UUID v7
SELECT gen_random_uuid();  -- v4 (random)
-- For v7 in older versions, use the pg_uuidv7 extension

The practical guide:

Use Case	Recommended Key
Single-database service, high write throughput	`BIGSERIAL`
Distributed system, client-generated IDs	UUID v7
Multi-tenant where tenants generate IDs	UUID v7
Public API (don't expose sequential IDs)	UUID v4 or v7
Pure internal table, no external exposure	`BIGSERIAL`

Numeric Types: Precision vs Storage

sql
-- For financial values: NEVER use FLOAT or DOUBLE PRECISION
-- Floating point cannot represent all decimal values exactly
SELECT 0.1 + 0.2 = 0.3;  -- false in floating point

-- Use NUMERIC for exact decimal arithmetic
amount NUMERIC(38, 8)  -- up to 38 digits, 8 decimal places

-- NUMERIC storage cost: variable, proportional to precision
-- NUMERIC(38, 8) stores up to 20 bytes per value
-- For high-throughput financial tables, this adds up

-- If you need exact integers (amounts in cents/satoshis):
amount_satoshis BIGINT  -- 8 bytes, fixed, fast arithmetic

For a blockchain indexer storing transaction amounts in the native token's smallest unit (satoshis, wei, etc.), BIGINT is almost always better than NUMERIC — it's smaller, faster, and the conversion to display units happens in the application layer.

Text Storage: VARCHAR vs TEXT

In Postgres, VARCHAR(n) and TEXT are stored identically. VARCHAR(n) adds a constraint check on insert/update. TEXT has no length limit.

sql
-- These are equivalent in storage:
name VARCHAR(255)
name TEXT

-- VARCHAR(n) adds constraint checking overhead on every insert/update
-- Use TEXT unless you genuinely want to enforce a maximum length

CHAR(n) is different — it pads values to length n with spaces. Almost never what you want. Avoid it.

BYTEA vs TEXT for Binary Data

Hashes, keys, and binary blobs should be stored as BYTEA, not hex strings in TEXT.

sql
-- Wrong: stores '0x' + 64 hex chars = 66 bytes per hash
hash TEXT  -- e.g., '0xabcdef...'

-- Correct: stores 32 raw bytes = 32 bytes per hash
hash BYTEA  -- 2x storage efficiency, exact binary comparison

-- Comparison is also more reliable:
-- TEXT comparison is locale-sensitive for some operators
-- BYTEA comparison is always byte-for-byte

For a blockchain indexer with 100M transaction hashes, switching from TEXT to BYTEA saves 3.4GB in the column alone — plus proportional reduction in index size.

Normalization vs Denormalization: The Production Trade-off

When Normalization Wins

Third Normal Form (3NF) — every non-key column depends only on the primary key — produces:

No data duplication (each fact stored once)
Referential integrity enforced by the database
Updates propagate automatically (change one row, reflected everywhere)
Smaller tables (more cache-efficient for targeted queries)

Normalization is correct for:

Frequently updated reference data (prices, statuses, user profiles)
Data where consistency matters more than read speed
OLTP workloads with many small, targeted queries

When Denormalization Wins

Denormalization — intentionally duplicating data to avoid joins — wins when:

Join cost exceeds duplication cost
The joined data is rarely or never updated
The query is on a hot path where every millisecond matters

sql
-- Normalized: transactions join to blocks for block metadata
SELECT t.hash, t.amount, b.validator, b.timestamp
FROM transactions t
JOIN blocks b ON b.height = t.block_height
WHERE t.id = 12345;

-- Denormalized: store validator and block_timestamp directly in transactions
SELECT hash, amount, block_validator, block_timestamp
FROM transactions
WHERE id = 12345;
-- Eliminates the join entirely

The cost of denormalization: if block_validator changes (rare in blockchain, common in other domains), you must update every transaction row for that block. If the data never changes after insert (append-only blockchain data), the cost is zero — denormalization is a pure win.

The practical rule: denormalize data that is:

Written once, read many times
Used together so frequently that joins are a consistent bottleneck
Not subject to updates that would require cascading changes

Partial Denormalization: Materialized Views

When you need join performance without full denormalization, materialized views store the join result:

sql
CREATE MATERIALIZED VIEW transaction_summary AS
SELECT
  t.id,
  t.hash,
  t.sender,
  t.amount,
  t.status,
  b.validator AS block_validator,
  b.timestamp AS block_timestamp
FROM transactions t
JOIN blocks b ON b.height = t.block_height
WITH DATA;

CREATE INDEX ON transaction_summary (id);
CREATE INDEX ON transaction_summary (sender, block_timestamp);

-- Refresh on schedule or after batch loads:
REFRESH MATERIALIZED VIEW CONCURRENTLY transaction_summary;

CONCURRENTLY allows reads during refresh (requires a unique index). Non-concurrent refresh takes an exclusive lock.

JSONB: When It's Right and When It's a Trap

The Legitimate Use Cases for JSONB

JSONB is genuinely useful for:

1. Schema-flexible attributes — when different rows have different sets of optional attributes and you cannot enumerate them all at schema design time:

sql
-- Blockchain events have different structures per event type
payload JSONB  -- {'type': 'Transfer', 'from': '0x...', 'to': '0x...', 'value': 100}
               -- {'type': 'Approval', 'owner': '0x...', 'spender': '0x...'}

2. External data ingestion — when you receive JSON from an external source and want to store it as-is without parsing every field:

sql
raw_webhook JSONB  -- store the full webhook payload for replay/debugging

3. Nested arrays/objects that don't fit the relational model cleanly:

sql
events JSONB  -- array of sub-events within a transaction

When JSONB Becomes a Trap

1. JSONB as a schema escape hatch for relational data:

sql
-- Wrong: using JSONB to avoid defining proper columns
user_data JSONB  -- {'name': 'John', 'email': 'john@...', 'age': 30}

-- Correct: these are stable, always-present attributes
name TEXT NOT NULL,
email TEXT NOT NULL,
age INT

When you put relational data in JSONB, you lose:

Column-level statistics (planner can't estimate selectivity on JSONB fields)
CHECK constraints on individual fields
NOT NULL guarantees per field
Type safety (JSONB stores everything as JSON types, not Postgres types)
Efficient index access without expression indexes

2. JSONB for frequently queried fields:

sql
-- Querying this requires a GIN index or sequential scan
WHERE payload->>'status' = 'confirmed'

-- This is always better:
WHERE status = 'confirmed'  -- direct column, B-tree index, planner statistics

3. JSONB for large blobs that inflate page size:

JSONB columns over ~2KB use TOAST (Module 1). TOAST reads require additional I/O and prevent JSONB fields from being covered in regular indexes. Large JSONB documents on a hot-query table inflate the table's page count and reduce cache effectiveness.

The rule: JSONB for truly schema-flexible, append-heavy data. Columns for everything you query, filter, or join on.

Table Partitioning

Partitioning divides a logical table into physical sub-tables (partitions) based on a column value. The correct partitioning strategy depends entirely on your access pattern.

Range Partitioning

Best for time-series data or sequential numeric keys where queries target ranges.

sql
CREATE TABLE transactions (
  id          BIGSERIAL,
  hash        BYTEA NOT NULL,
  block_height BIGINT NOT NULL,
  timestamp   TIMESTAMPTZ NOT NULL,
  amount      NUMERIC(38, 8)
) PARTITION BY RANGE (block_height);

CREATE TABLE transactions_0_to_5m
  PARTITION OF transactions FOR VALUES FROM (0) TO (5000000);

CREATE TABLE transactions_5m_to_10m
  PARTITION OF transactions FOR VALUES FROM (5000000) TO (10000000);

-- Create indexes on each partition (or use inherited indexes)
CREATE INDEX ON transactions_0_to_5m (block_height);
CREATE INDEX ON transactions_5m_to_10m (block_height);

Partition pruning: queries with a predicate on block_height only scan the relevant partition(s). A query for block_height = 7500000 only touches transactions_5m_to_10m.

The maintenance operation you must plan for: adding new partitions. If you partition by block height, you must create a new partition before data arrives for that range. Automate partition creation:

sql
-- Create next partition proactively (run as a scheduled job)
CREATE TABLE transactions_10m_to_15m
  PARTITION OF transactions FOR VALUES FROM (10000000) TO (15000000);

Hash Partitioning

Best for distributing write load evenly across partitions without a natural range key.

sql
CREATE TABLE sessions (
  id      UUID DEFAULT gen_random_uuid(),
  user_id BIGINT NOT NULL,
  data    JSONB
) PARTITION BY HASH (user_id);

CREATE TABLE sessions_0 PARTITION OF sessions FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE sessions_1 PARTITION OF sessions FOR VALUES WITH (MODULUS 4, REMAINDER 1);
CREATE TABLE sessions_2 PARTITION OF sessions FOR VALUES WITH (MODULUS 4, REMAINDER 2);
CREATE TABLE sessions_3 PARTITION OF sessions FOR VALUES WITH (MODULUS 4, REMAINDER 3);

Hash partitioning distributes rows evenly — each partition has roughly 25% of the data. It does not help with query pruning (a query for a specific user_id still goes to exactly one partition, but a range query for all users goes to all partitions).

List Partitioning

Best for categorical partitioning where you want to colocate data by a discrete value.

sql
CREATE TABLE events (
  id        BIGSERIAL,
  region    TEXT NOT NULL,
  event_type TEXT NOT NULL,
  payload   JSONB
) PARTITION BY LIST (region);

CREATE TABLE events_us PARTITION OF events FOR VALUES IN ('us-east', 'us-west');
CREATE TABLE events_eu PARTITION OF events FOR VALUES IN ('eu-west', 'eu-central');
CREATE TABLE events_asia PARTITION OF events FOR VALUES IN ('ap-southeast', 'ap-northeast');

When Partitioning Hurts

1. Cross-partition queries become more expensive:

sql
-- Without predicate on partition key = scans ALL partitions
SELECT count(*) FROM transactions WHERE status = 'pending';
-- Append node: scans every partition sequentially

If your hot queries don't filter on the partition key, partitioning adds overhead (multiple partition plans) without pruning benefit.

2. Cross-partition joins are expensive:

sql
SELECT t.*, b.validator
FROM transactions t  -- partitioned by block_height
JOIN blocks b ON b.height = t.block_height
WHERE t.sender = '0xabc...';  -- not the partition key
-- Must scan all transaction partitions for this sender

3. Partition count affects planning time:

Each partition is a separate table to the planner. A table with 500 partitions means the planner considers 500 relations when generating the plan. Planning time for complex queries on heavily-partitioned tables can become a significant overhead.

Keep partition count under 1,000. For time-series data, monthly or weekly partitions are usually right — daily partitions for a 5-year dataset creates 1,825 partitions.

4. Global indexes don't exist:

Postgres does not support global indexes (an index spanning all partitions). If you need a UNIQUE constraint on a non-partition-key column, you must either:

Include the partition key in the unique constraint
Enforce uniqueness at the application layer
Use a separate lookup table

sql
-- This works: hash is unique within each partition
ALTER TABLE transactions ADD CONSTRAINT transactions_hash_unique UNIQUE (block_height, hash);
-- block_height is the partition key, so this is enforced per-partition

-- This does NOT work: can't guarantee global uniqueness across partitions
-- ALTER TABLE transactions ADD CONSTRAINT transactions_hash_unique UNIQUE (hash);

Constraint Design

Constraints are the database's way of guaranteeing data integrity. They are not optional — they are the difference between a database you can trust and one you have to validate in application code.

NOT NULL

The most basic constraint. Use it on every column that should never be NULL.

sql
-- Wrong: allows NULL sender, creating ambiguity
sender VARCHAR(66)

-- Correct: sender is always required
sender VARCHAR(66) NOT NULL

NOT NULL constraints have zero runtime cost (NULL check is done on insert/update anyway). There is no reason not to use them.

CHECK Constraints

Enforce domain rules at the database level:

sql
ALTER TABLE transactions ADD CONSTRAINT amount_positive
CHECK (amount >= 0);

ALTER TABLE transactions ADD CONSTRAINT status_valid
CHECK (status IN ('pending', 'confirmed', 'failed'));

-- Complex constraint: amount must be 0 for 'failed' transactions
ALTER TABLE transactions ADD CONSTRAINT failed_zero_amount
CHECK (status != 'failed' OR amount = 0);

Adding a CHECK constraint to an existing table does a full table scan to validate existing rows. On a 500M-row table this takes minutes. To add without locking:

sql
-- Step 1: add constraint as NOT VALID (skips existing rows, enforces future inserts)
ALTER TABLE transactions
ADD CONSTRAINT amount_positive CHECK (amount >= 0) NOT VALID;

-- Step 2: validate existing rows (takes ShareUpdateExclusiveLock — reads allowed)
ALTER TABLE transactions VALIDATE CONSTRAINT amount_positive;

Foreign Keys: The Locking Implications

Foreign keys enforce referential integrity but have a subtle locking implication:

When you insert a row into a table with a foreign key, Postgres acquires a ShareLock on the referenced row. When you delete from the referenced table, Postgres acquires a lock that blocks concurrent FK-checking inserts.

On a high-throughput table with a foreign key to a frequently-modified parent, this can create lock contention.

sql
-- High-throughput inserts into transactions referencing blocks
-- Each insert acquires ShareLock on the referenced block row
-- Concurrent block updates cause contention

Options for high-throughput FK tables:

Enforce FK at the application layer, remove the database constraint
Keep the FK but ensure the parent table is updated infrequently
Use DEFERRABLE INITIALLY DEFERRED to defer FK checks to commit time (reduces lock duration)

UNIQUE Constraints and Indexes

UNIQUE constraints automatically create a unique B-tree index. Adding a unique constraint to a large table creates the index (which takes time and I/O) and validates all existing rows.

Online approach:

sql
-- Step 1: create the unique index concurrently (no blocking)
CREATE UNIQUE INDEX CONCURRENTLY idx_transactions_hash_unique ON transactions (hash);

-- Step 2: add the constraint using the existing index (instant)
ALTER TABLE transactions
ADD CONSTRAINT transactions_hash_unique UNIQUE USING INDEX idx_transactions_hash_unique;

Online Schema Migrations

Schema changes on live production tables require careful sequencing to avoid blocking reads and writes.

Adding a Column

sql
-- Adding a nullable column: instant (no table rewrite)
ALTER TABLE transactions ADD COLUMN processed_at TIMESTAMPTZ;

-- Adding a NOT NULL column with a default: instant in PG 11+ (stores default in catalog)
ALTER TABLE transactions ADD COLUMN version INT NOT NULL DEFAULT 1;
-- PG 11+: the default is stored in pg_attribute, not written to every row
-- PG 10 and earlier: full table rewrite required (hours on large tables)

-- Adding a NOT NULL column without default: requires backfill first
ALTER TABLE transactions ADD COLUMN region TEXT;              -- step 1: add nullable
UPDATE transactions SET region = 'us-east' WHERE region IS NULL;  -- step 2: backfill
ALTER TABLE transactions ALTER COLUMN region SET NOT NULL;     -- step 3: add constraint

Renaming a Column (Zero-Downtime)

Renaming a column in a single migration breaks any code using the old name:

sql
-- Wrong: instant outage if any connection uses old name
ALTER TABLE transactions RENAME COLUMN sender TO from_address;

Zero-downtime rename:

sql
-- Step 1: add new column
ALTER TABLE transactions ADD COLUMN from_address VARCHAR(66);

-- Step 2: keep new column in sync (trigger or application dual-write)
CREATE TRIGGER sync_from_address
BEFORE INSERT OR UPDATE ON transactions
FOR EACH ROW EXECUTE FUNCTION
  sync_from_address_func();  -- copies sender → from_address

-- Step 3: backfill old rows
UPDATE transactions SET from_address = sender WHERE from_address IS NULL;

-- Step 4: deploy application reading from new column
-- Step 5: drop trigger and old column
DROP TRIGGER sync_from_address ON transactions;
ALTER TABLE transactions DROP COLUMN sender;

Changing a Column Type

Type changes require a full table rewrite unless the new type is binary compatible.

sql
-- Requires full table rewrite (locks table):
ALTER TABLE transactions ALTER COLUMN amount TYPE NUMERIC(40, 8);

-- Zero-downtime approach:
-- Step 1: add new column
ALTER TABLE transactions ADD COLUMN amount_v2 NUMERIC(40, 8);

-- Step 2: backfill with conversion
UPDATE transactions SET amount_v2 = amount::NUMERIC(40, 8) WHERE amount_v2 IS NULL;

-- Step 3: application writes to both columns
-- Step 4: application reads from new column
-- Step 5: drop old column
ALTER TABLE transactions DROP COLUMN amount;
ALTER TABLE transactions RENAME COLUMN amount_v2 TO amount;

The `pg_repack` Alternative

For complex migrations, pg_repack can rewrite a table online with minimal lock time:

bash
# Rewrite table with new schema (specified via SQL file)
pg_repack -h localhost -d mydb -t transactions --schema-only

The Blockchain Indexer Schema: A Case Study

The transactions table used throughout this course reflects real decisions made on a blockchain indexer:

sql
CREATE TABLE transactions (
  -- BIGSERIAL not SERIAL — learned from a near-exhaustion incident
  id            BIGSERIAL PRIMARY KEY,

  -- BYTEA not TEXT — 2x storage savings, exact comparison
  hash          BYTEA NOT NULL UNIQUE,

  -- Indexed for range queries, partition key candidate
  block_height  BIGINT NOT NULL,

  -- VARCHAR for blockchain addresses (fixed max length = 66 chars for EVM)
  sender        VARCHAR(66) NOT NULL,
  recipient     VARCHAR(66),  -- nullable: coinbase transactions have no recipient

  -- NUMERIC for financial precision, not FLOAT
  amount        NUMERIC(38, 8),

  -- CHECK constraint enforces domain rule
  status        VARCHAR(20) NOT NULL DEFAULT 'pending'
                CHECK (status IN ('pending', 'confirmed', 'failed')),

  -- TIMESTAMPTZ not TIMESTAMP — always store timezone-aware timestamps
  timestamp     TIMESTAMPTZ NOT NULL,

  -- JSONB for genuinely variable event data
  payload       JSONB,

  created_at    TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Partition by block_height for range query performance
-- Each partition = 500,000 blocks ≈ ~7 days of Ethereum blocks
ALTER TABLE transactions PARTITION BY RANGE (block_height);

Decisions made and why:

BIGSERIAL over SERIAL: learned from a near-INT-exhaustion incident at 2B rows
BYTEA for hash: 32 bytes vs 66 bytes TEXT, exact comparison semantics
VARCHAR(66) for addresses: enforces max length (EVM addresses are always ≤66 chars with 0x prefix)
NUMERIC(38, 8) for amount: exact decimal arithmetic for financial values
TIMESTAMPTZ not TIMESTAMP: timezone-aware, consistent across deployments in different timezones
JSONB for payload: event data is genuinely variable across transaction types
Partition by block_height: API queries are always range queries on block height

Summary

Decision	Default	When to Override
Integer primary key	`BIGSERIAL`	Never use `SERIAL`
UUID vs BIGSERIAL	`BIGSERIAL`	UUID v7 for distributed/external-facing IDs
Financial values	`NUMERIC`	`BIGINT` in smallest unit if possible
Text storage	`TEXT`	`VARCHAR(n)` only when you genuinely need length enforcement
Binary data	`BYTEA`	Never store binary as TEXT hex
JSONB	Columns for stable attributes	JSONB only for genuinely variable schemas
Partitioning	None	Add when table exceeds 50–100GB or bulk partition drops are needed
Partition strategy	Range (time/ID)	Hash for even distribution, List for categorical
NOT NULL	On all required columns	Nullable only when absence has meaning
FK constraints	Use them	Remove under extreme write contention
Schema migrations	Step-by-step online	Never ALTER TYPE on a live table without a plan

Schema design sets the constraints that everything else operates within. Module 8 covers the biggest architectural change to Postgres in a decade — asynchronous I/O in PostgreSQL 18 — and what it means for everything you've learned so far.

Next: Module 8 — PostgreSQL 18: Asynchronous I/O and What It Changes →

PreviousModule 6: Query Planning and Execution: How Postgres Decides What to Do With Your Query Next Module 8: PostgreSQL 18: Asynchronous I/O and What It Changes