Every schema decision on day one compounds at 100x data volume. This module covers the decisions that matter most.
Module 7 — Schema Design at Scale: Decisions That Cannot Be Undone
What this module covers: Schema decisions made on day one compound at 100x data volume. Choosing the wrong primary key type, the wrong partition strategy, or the wrong normalization level creates constraints that are expensive or impossible to reverse without downtime. This module covers the decisions that matter most — not as rules to follow, but as trade-offs to understand so you can derive the right answer for your specific workload.
Why Schema Decisions Are Different From Other Decisions
Most engineering decisions are reversible. You can refactor code, swap a library, change an API response shape, rewrite a service. The cost is development time.
Schema decisions in a production database are different. Once a table has 500 million rows:
- Changing a column type requires a full table rewrite (hours of downtime, or a complex online migration)
- Adding a
NOT NULLcolumn requires a default or a multi-step migration to avoid locking - Changing a primary key type (
INT→BIGINT) requires rewriting the table and every foreign key table - Repartitioning requires rebuilding the entire table and its indexes
- Removing normalization (collapsing two tables) requires a data migration and application changes across every service that touches both tables
The cost of reversing a bad schema decision grows linearly with data volume and quadratically with the number of services that depend on the schema. At 10 billion rows, some decisions cannot be reversed at all without extended downtime.
This is why schema design deserves disproportionate thought at the start.
Data Type Choices: The Foundation
Integer Primary Keys: INT vs BIGINT
INT (32-bit signed) holds values up to 2,147,483,647 (~2.1 billion). BIGINT (64-bit signed) holds up to 9,223,372,036,854,775,807 (~9.2 quintillion).
The INT vs BIGINT decision seems trivial. It has caused production outages at multiple large companies.
At 1,000 inserts/second, an INT primary key exhausts in:
2,147,483,647 / 1,000 / 86,400 / 365 ≈ 68 years
At 10,000 inserts/second:
2,147,483,647 / 10,000 / 86,400 / 365 ≈ 6.8 years
At 100,000 inserts/second (realistic for a blockchain indexer):
2,147,483,647 / 100,000 / 86,400 / 365 ≈ 249 days
When an INT sequence exhausts, inserts fail with:
ERROR: integer out of range
Migrating from INT to BIGINT on a 500M-row table with 8 foreign key tables is a multi-day operation requiring careful sequencing of table rewrites and application deployments.
Rule: always use BIGINT for primary keys. The 4-byte difference per row is irrelevant at any scale. The migration cost if you're wrong is catastrophic.
sql-- Correct id BIGSERIAL PRIMARY KEY -- This will haunt you at scale id SERIAL PRIMARY KEY -- SERIAL = INT, not BIGINT
UUID vs BIGSERIAL: The Real Trade-off
UUIDs are attractive for distributed systems: generated client-side, globally unique without coordination, naturally partition-friendly. The cost is real and often underestimated.
UUIDs as primary keys on a B-tree index:
A UUID v4 is random. Inserting random UUIDs into a B-tree causes random page splits (Module 5) — every insert goes to a random leaf page instead of the rightmost page. This:
- Defeats the sequential-insert optimization in B-tree
- Causes all B-tree leaf pages to sit at ~50% fill factor
- Makes the primary key index 2x larger than it needs to be
- Increases write I/O: every insert touches a random cached page rather than the current append page
For a 500M-row table, the UUID primary key index might be 30GB instead of the 15GB it would be with BIGSERIAL. That's 15GB of extra cache pressure and 15GB of extra WAL on every full page write cycle.
UUID v7 (ordered): UUID v7 uses a time-based prefix, making them monotonically increasing. This restores the sequential-insert optimization. If you need UUIDs for distribution, use v7.
sql-- PostgreSQL 17+ has built-in UUID v7 SELECT gen_random_uuid(); -- v4 (random) -- For v7 in older versions, use the pg_uuidv7 extension
The practical guide:
| Use Case | Recommended Key |
|---|---|
| Single-database service, high write throughput | BIGSERIAL |
| Distributed system, client-generated IDs | UUID v7 |
| Multi-tenant where tenants generate IDs | UUID v7 |
| Public API (don't expose sequential IDs) | UUID v4 or v7 |
| Pure internal table, no external exposure | BIGSERIAL |
Numeric Types: Precision vs Storage
sql-- For financial values: NEVER use FLOAT or DOUBLE PRECISION -- Floating point cannot represent all decimal values exactly SELECT 0.1 + 0.2 = 0.3; -- false in floating point -- Use NUMERIC for exact decimal arithmetic amount NUMERIC(38, 8) -- up to 38 digits, 8 decimal places -- NUMERIC storage cost: variable, proportional to precision -- NUMERIC(38, 8) stores up to 20 bytes per value -- For high-throughput financial tables, this adds up -- If you need exact integers (amounts in cents/satoshis): amount_satoshis BIGINT -- 8 bytes, fixed, fast arithmetic
For a blockchain indexer storing transaction amounts in the native token's smallest unit (satoshis, wei, etc.), BIGINT is almost always better than NUMERIC — it's smaller, faster, and the conversion to display units happens in the application layer.
Text Storage: VARCHAR vs TEXT
In Postgres, VARCHAR(n) and TEXT are stored identically. VARCHAR(n) adds a constraint check on insert/update. TEXT has no length limit.
sql-- These are equivalent in storage: name VARCHAR(255) name TEXT -- VARCHAR(n) adds constraint checking overhead on every insert/update -- Use TEXT unless you genuinely want to enforce a maximum length
CHAR(n) is different — it pads values to length n with spaces. Almost never what you want. Avoid it.
BYTEA vs TEXT for Binary Data
Hashes, keys, and binary blobs should be stored as BYTEA, not hex strings in TEXT.
sql-- Wrong: stores '0x' + 64 hex chars = 66 bytes per hash hash TEXT -- e.g., '0xabcdef...' -- Correct: stores 32 raw bytes = 32 bytes per hash hash BYTEA -- 2x storage efficiency, exact binary comparison -- Comparison is also more reliable: -- TEXT comparison is locale-sensitive for some operators -- BYTEA comparison is always byte-for-byte
For a blockchain indexer with 100M transaction hashes, switching from TEXT to BYTEA saves 3.4GB in the column alone — plus proportional reduction in index size.
Normalization vs Denormalization: The Production Trade-off
When Normalization Wins
Third Normal Form (3NF) — every non-key column depends only on the primary key — produces:
- No data duplication (each fact stored once)
- Referential integrity enforced by the database
- Updates propagate automatically (change one row, reflected everywhere)
- Smaller tables (more cache-efficient for targeted queries)
Normalization is correct for:
- Frequently updated reference data (prices, statuses, user profiles)
- Data where consistency matters more than read speed
- OLTP workloads with many small, targeted queries
When Denormalization Wins
Denormalization — intentionally duplicating data to avoid joins — wins when:
- Join cost exceeds duplication cost
- The joined data is rarely or never updated
- The query is on a hot path where every millisecond matters
sql-- Normalized: transactions join to blocks for block metadata SELECT t.hash, t.amount, b.validator, b.timestamp FROM transactions t JOIN blocks b ON b.height = t.block_height WHERE t.id = 12345; -- Denormalized: store validator and block_timestamp directly in transactions SELECT hash, amount, block_validator, block_timestamp FROM transactions WHERE id = 12345; -- Eliminates the join entirely
The cost of denormalization: if block_validator changes (rare in blockchain, common in other domains), you must update every transaction row for that block. If the data never changes after insert (append-only blockchain data), the cost is zero — denormalization is a pure win.
The practical rule: denormalize data that is:
- Written once, read many times
- Used together so frequently that joins are a consistent bottleneck
- Not subject to updates that would require cascading changes
Partial Denormalization: Materialized Views
When you need join performance without full denormalization, materialized views store the join result:
sqlCREATE MATERIALIZED VIEW transaction_summary AS SELECT t.id, t.hash, t.sender, t.amount, t.status, b.validator AS block_validator, b.timestamp AS block_timestamp FROM transactions t JOIN blocks b ON b.height = t.block_height WITH DATA; CREATE INDEX ON transaction_summary (id); CREATE INDEX ON transaction_summary (sender, block_timestamp); -- Refresh on schedule or after batch loads: REFRESH MATERIALIZED VIEW CONCURRENTLY transaction_summary;
CONCURRENTLY allows reads during refresh (requires a unique index). Non-concurrent refresh takes an exclusive lock.
JSONB: When It's Right and When It's a Trap
The Legitimate Use Cases for JSONB
JSONB is genuinely useful for:
1. Schema-flexible attributes — when different rows have different sets of optional attributes and you cannot enumerate them all at schema design time:
sql-- Blockchain events have different structures per event type payload JSONB -- {'type': 'Transfer', 'from': '0x...', 'to': '0x...', 'value': 100} -- {'type': 'Approval', 'owner': '0x...', 'spender': '0x...'}
2. External data ingestion — when you receive JSON from an external source and want to store it as-is without parsing every field:
sqlraw_webhook JSONB -- store the full webhook payload for replay/debugging
3. Nested arrays/objects that don't fit the relational model cleanly:
sqlevents JSONB -- array of sub-events within a transaction
When JSONB Becomes a Trap
1. JSONB as a schema escape hatch for relational data:
sql-- Wrong: using JSONB to avoid defining proper columns user_data JSONB -- {'name': 'John', 'email': 'john@...', 'age': 30} -- Correct: these are stable, always-present attributes name TEXT NOT NULL, email TEXT NOT NULL, age INT
When you put relational data in JSONB, you lose:
- Column-level statistics (planner can't estimate selectivity on JSONB fields)
- CHECK constraints on individual fields
- NOT NULL guarantees per field
- Type safety (JSONB stores everything as JSON types, not Postgres types)
- Efficient index access without expression indexes
2. JSONB for frequently queried fields:
sql-- Querying this requires a GIN index or sequential scan WHERE payload->>'status' = 'confirmed' -- This is always better: WHERE status = 'confirmed' -- direct column, B-tree index, planner statistics
3. JSONB for large blobs that inflate page size:
JSONB columns over ~2KB use TOAST (Module 1). TOAST reads require additional I/O and prevent JSONB fields from being covered in regular indexes. Large JSONB documents on a hot-query table inflate the table's page count and reduce cache effectiveness.
The rule: JSONB for truly schema-flexible, append-heavy data. Columns for everything you query, filter, or join on.
Table Partitioning
Partitioning divides a logical table into physical sub-tables (partitions) based on a column value. The correct partitioning strategy depends entirely on your access pattern.
Range Partitioning
Best for time-series data or sequential numeric keys where queries target ranges.
sqlCREATE TABLE transactions ( id BIGSERIAL, hash BYTEA NOT NULL, block_height BIGINT NOT NULL, timestamp TIMESTAMPTZ NOT NULL, amount NUMERIC(38, 8) ) PARTITION BY RANGE (block_height); CREATE TABLE transactions_0_to_5m PARTITION OF transactions FOR VALUES FROM (0) TO (5000000); CREATE TABLE transactions_5m_to_10m PARTITION OF transactions FOR VALUES FROM (5000000) TO (10000000); -- Create indexes on each partition (or use inherited indexes) CREATE INDEX ON transactions_0_to_5m (block_height); CREATE INDEX ON transactions_5m_to_10m (block_height);
Partition pruning: queries with a predicate on block_height only scan the relevant partition(s). A query for block_height = 7500000 only touches transactions_5m_to_10m.
The maintenance operation you must plan for: adding new partitions. If you partition by block height, you must create a new partition before data arrives for that range. Automate partition creation:
sql-- Create next partition proactively (run as a scheduled job) CREATE TABLE transactions_10m_to_15m PARTITION OF transactions FOR VALUES FROM (10000000) TO (15000000);
Hash Partitioning
Best for distributing write load evenly across partitions without a natural range key.
sqlCREATE TABLE sessions ( id UUID DEFAULT gen_random_uuid(), user_id BIGINT NOT NULL, data JSONB ) PARTITION BY HASH (user_id); CREATE TABLE sessions_0 PARTITION OF sessions FOR VALUES WITH (MODULUS 4, REMAINDER 0); CREATE TABLE sessions_1 PARTITION OF sessions FOR VALUES WITH (MODULUS 4, REMAINDER 1); CREATE TABLE sessions_2 PARTITION OF sessions FOR VALUES WITH (MODULUS 4, REMAINDER 2); CREATE TABLE sessions_3 PARTITION OF sessions FOR VALUES WITH (MODULUS 4, REMAINDER 3);
Hash partitioning distributes rows evenly — each partition has roughly 25% of the data. It does not help with query pruning (a query for a specific user_id still goes to exactly one partition, but a range query for all users goes to all partitions).
List Partitioning
Best for categorical partitioning where you want to colocate data by a discrete value.
sqlCREATE TABLE events ( id BIGSERIAL, region TEXT NOT NULL, event_type TEXT NOT NULL, payload JSONB ) PARTITION BY LIST (region); CREATE TABLE events_us PARTITION OF events FOR VALUES IN ('us-east', 'us-west'); CREATE TABLE events_eu PARTITION OF events FOR VALUES IN ('eu-west', 'eu-central'); CREATE TABLE events_asia PARTITION OF events FOR VALUES IN ('ap-southeast', 'ap-northeast');
When Partitioning Hurts
1. Cross-partition queries become more expensive:
sql-- Without predicate on partition key = scans ALL partitions SELECT count(*) FROM transactions WHERE status = 'pending'; -- Append node: scans every partition sequentially
If your hot queries don't filter on the partition key, partitioning adds overhead (multiple partition plans) without pruning benefit.
2. Cross-partition joins are expensive:
sqlSELECT t.*, b.validator FROM transactions t -- partitioned by block_height JOIN blocks b ON b.height = t.block_height WHERE t.sender = '0xabc...'; -- not the partition key -- Must scan all transaction partitions for this sender
3. Partition count affects planning time:
Each partition is a separate table to the planner. A table with 500 partitions means the planner considers 500 relations when generating the plan. Planning time for complex queries on heavily-partitioned tables can become a significant overhead.
Keep partition count under 1,000. For time-series data, monthly or weekly partitions are usually right — daily partitions for a 5-year dataset creates 1,825 partitions.
4. Global indexes don't exist:
Postgres does not support global indexes (an index spanning all partitions). If you need a UNIQUE constraint on a non-partition-key column, you must either:
- Include the partition key in the unique constraint
- Enforce uniqueness at the application layer
- Use a separate lookup table
sql-- This works: hash is unique within each partition ALTER TABLE transactions ADD CONSTRAINT transactions_hash_unique UNIQUE (block_height, hash); -- block_height is the partition key, so this is enforced per-partition -- This does NOT work: can't guarantee global uniqueness across partitions -- ALTER TABLE transactions ADD CONSTRAINT transactions_hash_unique UNIQUE (hash);
Constraint Design
Constraints are the database's way of guaranteeing data integrity. They are not optional — they are the difference between a database you can trust and one you have to validate in application code.
NOT NULL
The most basic constraint. Use it on every column that should never be NULL.
sql-- Wrong: allows NULL sender, creating ambiguity sender VARCHAR(66) -- Correct: sender is always required sender VARCHAR(66) NOT NULL
NOT NULL constraints have zero runtime cost (NULL check is done on insert/update anyway). There is no reason not to use them.
CHECK Constraints
Enforce domain rules at the database level:
sqlALTER TABLE transactions ADD CONSTRAINT amount_positive CHECK (amount >= 0); ALTER TABLE transactions ADD CONSTRAINT status_valid CHECK (status IN ('pending', 'confirmed', 'failed')); -- Complex constraint: amount must be 0 for 'failed' transactions ALTER TABLE transactions ADD CONSTRAINT failed_zero_amount CHECK (status != 'failed' OR amount = 0);
Adding a CHECK constraint to an existing table does a full table scan to validate existing rows. On a 500M-row table this takes minutes. To add without locking:
sql-- Step 1: add constraint as NOT VALID (skips existing rows, enforces future inserts) ALTER TABLE transactions ADD CONSTRAINT amount_positive CHECK (amount >= 0) NOT VALID; -- Step 2: validate existing rows (takes ShareUpdateExclusiveLock — reads allowed) ALTER TABLE transactions VALIDATE CONSTRAINT amount_positive;
Foreign Keys: The Locking Implications
Foreign keys enforce referential integrity but have a subtle locking implication:
When you insert a row into a table with a foreign key, Postgres acquires a ShareLock on the referenced row. When you delete from the referenced table, Postgres acquires a lock that blocks concurrent FK-checking inserts.
On a high-throughput table with a foreign key to a frequently-modified parent, this can create lock contention.
sql-- High-throughput inserts into transactions referencing blocks -- Each insert acquires ShareLock on the referenced block row -- Concurrent block updates cause contention
Options for high-throughput FK tables:
- Enforce FK at the application layer, remove the database constraint
- Keep the FK but ensure the parent table is updated infrequently
- Use
DEFERRABLE INITIALLY DEFERREDto defer FK checks to commit time (reduces lock duration)
UNIQUE Constraints and Indexes
UNIQUE constraints automatically create a unique B-tree index. Adding a unique constraint to a large table creates the index (which takes time and I/O) and validates all existing rows.
Online approach:
sql-- Step 1: create the unique index concurrently (no blocking) CREATE UNIQUE INDEX CONCURRENTLY idx_transactions_hash_unique ON transactions (hash); -- Step 2: add the constraint using the existing index (instant) ALTER TABLE transactions ADD CONSTRAINT transactions_hash_unique UNIQUE USING INDEX idx_transactions_hash_unique;
Online Schema Migrations
Schema changes on live production tables require careful sequencing to avoid blocking reads and writes.
Adding a Column
sql-- Adding a nullable column: instant (no table rewrite) ALTER TABLE transactions ADD COLUMN processed_at TIMESTAMPTZ; -- Adding a NOT NULL column with a default: instant in PG 11+ (stores default in catalog) ALTER TABLE transactions ADD COLUMN version INT NOT NULL DEFAULT 1; -- PG 11+: the default is stored in pg_attribute, not written to every row -- PG 10 and earlier: full table rewrite required (hours on large tables) -- Adding a NOT NULL column without default: requires backfill first ALTER TABLE transactions ADD COLUMN region TEXT; -- step 1: add nullable UPDATE transactions SET region = 'us-east' WHERE region IS NULL; -- step 2: backfill ALTER TABLE transactions ALTER COLUMN region SET NOT NULL; -- step 3: add constraint
Renaming a Column (Zero-Downtime)
Renaming a column in a single migration breaks any code using the old name:
sql-- Wrong: instant outage if any connection uses old name ALTER TABLE transactions RENAME COLUMN sender TO from_address;
Zero-downtime rename:
sql-- Step 1: add new column ALTER TABLE transactions ADD COLUMN from_address VARCHAR(66); -- Step 2: keep new column in sync (trigger or application dual-write) CREATE TRIGGER sync_from_address BEFORE INSERT OR UPDATE ON transactions FOR EACH ROW EXECUTE FUNCTION sync_from_address_func(); -- copies sender → from_address -- Step 3: backfill old rows UPDATE transactions SET from_address = sender WHERE from_address IS NULL; -- Step 4: deploy application reading from new column -- Step 5: drop trigger and old column DROP TRIGGER sync_from_address ON transactions; ALTER TABLE transactions DROP COLUMN sender;
Changing a Column Type
Type changes require a full table rewrite unless the new type is binary compatible.
sql-- Requires full table rewrite (locks table): ALTER TABLE transactions ALTER COLUMN amount TYPE NUMERIC(40, 8); -- Zero-downtime approach: -- Step 1: add new column ALTER TABLE transactions ADD COLUMN amount_v2 NUMERIC(40, 8); -- Step 2: backfill with conversion UPDATE transactions SET amount_v2 = amount::NUMERIC(40, 8) WHERE amount_v2 IS NULL; -- Step 3: application writes to both columns -- Step 4: application reads from new column -- Step 5: drop old column ALTER TABLE transactions DROP COLUMN amount; ALTER TABLE transactions RENAME COLUMN amount_v2 TO amount;
The pg_repack Alternative
For complex migrations, pg_repack can rewrite a table online with minimal lock time:
bash# Rewrite table with new schema (specified via SQL file) pg_repack -h localhost -d mydb -t transactions --schema-only
The Blockchain Indexer Schema: A Case Study
The transactions table used throughout this course reflects real decisions made on a blockchain indexer:
sqlCREATE TABLE transactions ( -- BIGSERIAL not SERIAL — learned from a near-exhaustion incident id BIGSERIAL PRIMARY KEY, -- BYTEA not TEXT — 2x storage savings, exact comparison hash BYTEA NOT NULL UNIQUE, -- Indexed for range queries, partition key candidate block_height BIGINT NOT NULL, -- VARCHAR for blockchain addresses (fixed max length = 66 chars for EVM) sender VARCHAR(66) NOT NULL, recipient VARCHAR(66), -- nullable: coinbase transactions have no recipient -- NUMERIC for financial precision, not FLOAT amount NUMERIC(38, 8), -- CHECK constraint enforces domain rule status VARCHAR(20) NOT NULL DEFAULT 'pending' CHECK (status IN ('pending', 'confirmed', 'failed')), -- TIMESTAMPTZ not TIMESTAMP — always store timezone-aware timestamps timestamp TIMESTAMPTZ NOT NULL, -- JSONB for genuinely variable event data payload JSONB, created_at TIMESTAMPTZ NOT NULL DEFAULT now() ); -- Partition by block_height for range query performance -- Each partition = 500,000 blocks ≈ ~7 days of Ethereum blocks ALTER TABLE transactions PARTITION BY RANGE (block_height);
Decisions made and why:
BIGSERIALoverSERIAL: learned from a near-INT-exhaustion incident at 2B rowsBYTEAforhash: 32 bytes vs 66 bytes TEXT, exact comparison semanticsVARCHAR(66)for addresses: enforces max length (EVM addresses are always ≤66 chars with0xprefix)NUMERIC(38, 8)for amount: exact decimal arithmetic for financial valuesTIMESTAMPTZnotTIMESTAMP: timezone-aware, consistent across deployments in different timezonesJSONBforpayload: event data is genuinely variable across transaction types- Partition by
block_height: API queries are always range queries on block height
Summary
| Decision | Default | When to Override |
|---|---|---|
| Integer primary key | BIGSERIAL | Never use SERIAL |
| UUID vs BIGSERIAL | BIGSERIAL | UUID v7 for distributed/external-facing IDs |
| Financial values | NUMERIC | BIGINT in smallest unit if possible |
| Text storage | TEXT | VARCHAR(n) only when you genuinely need length enforcement |
| Binary data | BYTEA | Never store binary as TEXT hex |
| JSONB | Columns for stable attributes | JSONB only for genuinely variable schemas |
| Partitioning | None | Add when table exceeds 50–100GB or bulk partition drops are needed |
| Partition strategy | Range (time/ID) | Hash for even distribution, List for categorical |
| NOT NULL | On all required columns | Nullable only when absence has meaning |
| FK constraints | Use them | Remove under extreme write contention |
| Schema migrations | Step-by-step online | Never ALTER TYPE on a live table without a plan |
Schema design sets the constraints that everything else operates within. Module 8 covers the biggest architectural change to Postgres in a decade — asynchronous I/O in PostgreSQL 18 — and what it means for everything you've learned so far.
Next: Module 8 — PostgreSQL 18: Asynchronous I/O and What It Changes →