Module 15·23 min read

How Postgres extensions replace entire categories of specialized databases — without leaving ACID behind.

Module 15 — The Extensions Ecosystem: pg_cron, TimescaleDB, Citus, and pgvector

What this module covers: PostgreSQL's extension system is its most underappreciated feature. Extensions add capabilities that match or exceed dedicated NoSQL databases — without leaving the ACID guarantees, SQL interface, and operational tooling you already have. This module covers the four extensions that most frequently eliminate the need for a separate database: pg_cron for background job scheduling, TimescaleDB for time-series at scale, Citus for horizontal sharding, and pgvector for AI/embedding workloads.

Why Extensions Matter

In Module 11, we compared Postgres to MongoDB, Cassandra, and ClickHouse. Those comparisons assumed vanilla Postgres. With extensions, the picture changes:

Need time-series with columnar compression? → TimescaleDB makes Postgres competitive with InfluxDB and ClickHouse
Need horizontal write scaling? → Citus distributes Postgres across many nodes
Need vector similarity search for AI? → pgvector makes Postgres competitive with Pinecone and Weaviate
Need background job scheduling? → pg_cron replaces external cron + application-layer job tables

Each extension adds specialized capabilities while keeping the full Postgres stack: MVCC, WAL, ACID, SQL, pg_stat_*, your existing monitoring, your existing connection pooler, your existing backup tooling.

`pg_cron`: Background Job Scheduling Inside Postgres

pg_cron runs SQL or stored procedures on a cron schedule, directly inside the database process. No external scheduler, no separate worker service, no application-layer job table.

Installation and Setup

sql
-- Add to shared_preload_libraries (requires restart)
-- postgresql.conf:
-- shared_preload_libraries = 'pg_cron'
-- cron.database_name = 'mydb'

CREATE EXTENSION pg_cron;

Scheduling Jobs

sql
-- Run ANALYZE on the transactions table every night at 2 AM
SELECT cron.schedule(
  'nightly-analyze-transactions',
  '0 2 * * *',
  'ANALYZE transactions'
);

-- Run partition maintenance every hour
SELECT cron.schedule(
  'partman-hourly',
  '0 * * * *',
  $$SELECT partman.run_maintenance_proc()$$
);

-- Clean up expired sessions every 5 minutes
SELECT cron.schedule(
  'cleanup-expired-sessions',
  '*/5 * * * *',
  $$DELETE FROM sessions WHERE expires_at < now()$$
);

-- Run a stored procedure weekly on Sunday at 3 AM
SELECT cron.schedule(
  'weekly-aggregation',
  '0 3 * * 0',
  $$CALL compute_weekly_stats()$$
);

Managing Jobs

sql
-- List all scheduled jobs
SELECT jobid, jobname, schedule, command, active
FROM cron.job
ORDER BY jobid;

-- View recent job run history
SELECT
  jobid,
  job_pid,
  database,
  username,
  command,
  status,
  start_time,
  end_time,
  return_message
FROM cron.job_run_details
ORDER BY start_time DESC
LIMIT 20;

-- Disable a job without removing it
UPDATE cron.job SET active = false WHERE jobname = 'nightly-analyze-transactions';

-- Remove a job
SELECT cron.unschedule('nightly-analyze-transactions');

When `pg_cron` Is the Right Tool

pg_cron is ideal for:

Database maintenance tasks (ANALYZE, VACUUM, partition creation)
Data retention cleanup (delete rows older than N days)
Periodic aggregation into summary tables
Refreshing materialized views on a schedule

It is not suitable for:

Long-running jobs that need external resources (API calls, file I/O)
Jobs that need distributed coordination across multiple Postgres instances
Jobs with complex dependency graphs (use Airflow, Prefect, or similar)

TimescaleDB: Time-Series at Postgres Scale

TimescaleDB is an open-source extension that adds automatic time-based partitioning (hypertables), columnar compression, and time-series-specific query functions on top of Postgres.

Hypertables

A hypertable is a Postgres table with automatic time-based partitioning managed by TimescaleDB. Data is partitioned into "chunks" automatically — no manual CREATE TABLE partition ... required.

sql
CREATE EXTENSION timescaledb;

-- Convert a regular table to a hypertable
CREATE TABLE metrics (
  time        TIMESTAMPTZ NOT NULL,
  device_id   BIGINT NOT NULL,
  temperature DOUBLE PRECISION,
  humidity    DOUBLE PRECISION
);

-- Convert to hypertable, partitioned by 'time' with 7-day chunks
SELECT create_hypertable('metrics', 'time', chunk_time_interval => INTERVAL '7 days');

-- Inserts now automatically route to the correct chunk
INSERT INTO metrics VALUES (now(), 1001, 23.5, 68.2);

The hypertable looks and behaves like a regular Postgres table. All standard SQL works. TimescaleDB handles partitioning transparently.

Columnar Compression

TimescaleDB's native columnar compression is the feature that makes it genuinely competitive with InfluxDB and ClickHouse for time-series workloads.

sql
-- Enable compression on chunks older than 7 days
ALTER TABLE metrics SET (
  timescaledb.compress,
  timescaledb.compress_segmentby = 'device_id',  -- group by device for compression
  timescaledb.compress_orderby = 'time DESC'
);

-- Automate compression with a policy
SELECT add_compression_policy('metrics', INTERVAL '7 days');

-- Check compression ratios
SELECT
  chunk_name,
  pg_size_pretty(before_compression_total_bytes) AS before,
  pg_size_pretty(after_compression_total_bytes) AS after,
  ROUND(100.0 * (1 - after_compression_total_bytes::numeric /
    NULLIF(before_compression_total_bytes, 0)), 1) AS compression_pct
FROM chunk_compression_stats('metrics')
ORDER BY chunk_name;

Typical compression ratios for time-series data: 10–40x. A 100GB raw metrics table becomes 3–10GB compressed. Compressed chunks are read using a columnar scan — only the requested columns are decompressed, dramatically reducing I/O for aggregation queries.

Continuous Aggregates

Continuous aggregates are materialized views that update automatically as new data arrives:

sql
-- Create a continuous aggregate: hourly averages per device
CREATE MATERIALIZED VIEW metrics_hourly
WITH (timescaledb.continuous) AS
SELECT
  time_bucket('1 hour', time) AS bucket,
  device_id,
  AVG(temperature) AS avg_temp,
  AVG(humidity) AS avg_humidity,
  MAX(temperature) AS max_temp,
  MIN(temperature) AS min_temp
FROM metrics
GROUP BY bucket, device_id;

-- Refresh policy: update the aggregate every 30 minutes for recent data
SELECT add_continuous_aggregate_policy(
  'metrics_hourly',
  start_offset => INTERVAL '3 hours',
  end_offset   => INTERVAL '1 hour',
  schedule_interval => INTERVAL '30 minutes'
);

-- Query the aggregate (much faster than raw table for dashboards)
SELECT bucket, device_id, avg_temp
FROM metrics_hourly
WHERE bucket > now() - INTERVAL '24 hours'
ORDER BY bucket DESC;

Data Retention

sql
-- Automatically drop chunks older than 90 days
SELECT add_retention_policy('metrics', INTERVAL '90 days');

TimescaleDB vs Plain Postgres Partitioning

	Plain Postgres Partitioning	TimescaleDB
Partition creation	Manual or pg_partman	Automatic
Columnar compression	No	Yes (10–40x)
Continuous aggregates	Manual with triggers	Native
Retention policies	Manual pg_cron job	Native
SQL compatibility	Full	Full
Extension required	No	Yes

TimescaleDB is the right choice when: you have high-volume time-series data (metrics, events, IoT) that needs compression and fast aggregate queries, and you want to stay on Postgres rather than migrating to InfluxDB or ClickHouse.

Citus: Horizontal Sharding for Postgres

Citus distributes a Postgres database across multiple nodes — sharding both data and query execution. It is built by Microsoft and available as an open-source extension or managed service (Azure Cosmos DB for PostgreSQL).

The Citus Architecture

Coordinator node:
  - Accepts all SQL queries
  - Parses and plans queries
  - Routes data operations to the correct worker(s)
  - Aggregates results from workers

Worker nodes (N nodes):
  - Store shards of distributed tables
  - Execute fragments of queries pushed down from coordinator
  - Each shard is a regular Postgres table on the worker

Distribution Key Selection

sql
-- On the coordinator: distribute transactions by sender address
SELECT create_distributed_table('transactions', 'sender');

-- Now:
-- Rows with the same sender are colocated on the same shard
-- Queries filtering by sender go to exactly one worker
-- Aggregations by sender are pushed down to workers, results aggregated on coordinator

The distribution key is the most critical decision. Choose it so:

Frequently joined tables are colocated — if transactions and blocks are joined frequently, distribute both by a compatible key
Hot queries filter on it — WHERE sender = '0xabc...' goes to one shard, not all shards
Data is evenly distributed — high-cardinality columns like UUIDs or blockchain addresses

Reference Tables

Some tables are small and frequently joined — distribute them to all workers:

sql
-- Reference table: replicated to every worker node
SELECT create_reference_table('status_codes');
-- Every worker has a full copy — joins with reference tables are local

When Citus Is the Right Answer

Citus is appropriate when:

Write throughput exceeds what a single Postgres instance handles (~50,000+ TPS)
Data volume exceeds what a single server stores efficiently (10+ TB)
Most queries filter on the distribution key — if queries frequently scan all shards, sharding adds overhead without benefit

For most applications below these thresholds, a single Postgres primary with read replicas is simpler and sufficient.

pgvector: Vector Embeddings for AI Workloads

pgvector adds a vector data type and vector similarity search indexes to Postgres. It is the extension that makes Postgres competitive with dedicated vector databases like Pinecone, Weaviate, and Chroma for AI/RAG (Retrieval-Augmented Generation) workloads.

What Vector Search Is

An embedding is a fixed-size array of floating-point numbers that represents the semantic meaning of text, an image, or any other content. Embeddings with similar meaning have small vector distances. Searching for the most similar embeddings is the core operation in:

RAG (Retrieval-Augmented Generation): find documents semantically similar to a query, feed them to an LLM as context
Semantic search: return results by meaning, not keyword matching
Recommendation systems: find items similar to what a user engaged with
Duplicate detection: find near-duplicate content by vector proximity

Installation and Schema

sql
CREATE EXTENSION vector;

-- Store 1536-dimensional embeddings (OpenAI ada-002 output size)
CREATE TABLE documents (
  id          BIGSERIAL PRIMARY KEY,
  content     TEXT NOT NULL,
  embedding   vector(1536),
  metadata    JSONB,
  created_at  TIMESTAMPTZ DEFAULT now()
);

Inserting Embeddings

python
import openai
import psycopg2

# Generate embedding from OpenAI
response = openai.embeddings.create(
    input="PostgreSQL internals for production engineers",
    model="text-embedding-ada-002"
)
embedding = response.data[0].embedding  # list of 1536 floats

# Insert into Postgres
cur.execute(
    "INSERT INTO documents (content, embedding) VALUES (%s, %s)",
    ("PostgreSQL internals...", embedding)
)

Vector Similarity Search: Exact vs Approximate

Exact search (no index — scans all vectors):

sql
-- Find the 5 most similar documents to a query embedding
SELECT id, content,
  embedding <-> '[0.1, 0.2, ...]'::vector AS distance
FROM documents
ORDER BY embedding <-> '[0.1, 0.2, ...]'::vector
LIMIT 5;

Operators:

<-> — L2 (Euclidean) distance
<#> — negative inner product (for dot product similarity)
<=> — cosine distance (most common for text embeddings)

Exact search is O(N) — every vector is compared. For up to ~100K vectors, this is fast enough. Above that, you need an approximate nearest-neighbor (ANN) index.

IVFFlat Index

IVFFlat (Inverted File Flat) divides vectors into lists clusters. A query searches only the nearest N clusters instead of all vectors.

sql
-- Build IVFFlat index: 100 lists for ~1M vectors
-- Rule of thumb: lists ≈ sqrt(row_count)
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

-- Query with probe count: search nearest probes clusters
-- Higher probes = better recall, slower query
SET ivfflat.probes = 10;  -- search 10 of 100 clusters

SELECT id, content, embedding <=> $1 AS distance
FROM documents
ORDER BY embedding <=> $1
LIMIT 5;

Recall vs speed trade-off:

probes = 1: fastest, lowest recall (~70% of exact results)
probes = 10: balanced (~90% recall)
probes = lists: exact search (same as no index)

Build IVFFlat only after the table has data — the index quality depends on the data distribution at build time. Rebuild if data distribution changes significantly.

HNSW Index (Recommended)

HNSW (Hierarchical Navigable Small World) builds a graph-based index with better recall/speed trade-offs than IVFFlat for most workloads.

sql
-- Build HNSW index
-- m: number of connections per node (higher = better recall, larger index)
-- ef_construction: search width during build (higher = better index, slower build)
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- Query: ef_search controls recall at query time (higher = better recall, slower)
SET hnsw.ef_search = 100;

SELECT id, content, embedding <=> $1 AS distance
FROM documents
ORDER BY embedding <=> $1
LIMIT 5;

HNSW vs IVFFlat:

	IVFFlat	HNSW
Build time	Fast	Slower (especially for large datasets)
Memory	Lower	Higher (graph stored in memory)
Recall at same speed	Lower	Higher
Incremental inserts	Requires rebuild for best quality	Good — inserts update the graph
Best for	>1M vectors, memory-constrained	Most workloads, especially with inserts

Hybrid Search: Vector + SQL Filters

The real power of pgvector in Postgres: combine semantic vector search with SQL predicates in one query:

sql
-- Find the 5 most similar documents to a query embedding
-- but only from the 'technical' category published in 2026
SELECT id, content, embedding <=> $1 AS distance
FROM documents
WHERE category = 'technical'
  AND created_at >= '2026-01-01'
ORDER BY embedding <=> $1
LIMIT 5;

A dedicated vector database requires an additional filter step after the ANN search — often requiring you to fetch more candidates than needed and filter in the application. Postgres does this in a single query with the planner's cost model deciding the optimal execution order.

pgvector vs Dedicated Vector Databases

	pgvector	Pinecone / Weaviate
ACID transactions	Yes	No
SQL filters in same query	Native	Post-processing
Operational complexity	Same as Postgres	New system to operate
Metadata storage	JSONB + columns	Separate metadata store
Scale	~100M vectors (1536-dim, 128GB RAM)	Unlimited (distributed)
Cost	Postgres hosting cost	Per-vector pricing

For most RAG and semantic search applications below 100M vectors: pgvector in Postgres eliminates the need for a dedicated vector database entirely. Your embeddings, metadata, and business data live in one system with consistent transactions.

Summary

Extension	Replaces	When to Use
`pg_cron`	External cron + job tables	DB maintenance, data cleanup, periodic aggregation
TimescaleDB	InfluxDB, ClickHouse (time-series)	High-volume time-series, IoT metrics, automated compression
Citus	Manual sharding, Cassandra (write scale)	>50K TPS write throughput or >10TB single-table datasets
`pgvector`	Pinecone, Weaviate, Chroma	RAG, semantic search, recommendation systems, <100M vectors

The principle: before adding a new database to your infrastructure, check if a Postgres extension solves the problem. The operational cost of one well-understood Postgres cluster is almost always lower than operating two specialized systems.

PreviousModule 14: Advanced Partitioning: Pruning, Maintenance, and pg_partman Next Module 16: Connection Pooling Failure Modes: PgBouncer, Serverless, and the Edge