Back/Module 15 The Extensions Ecosystem: pg_cron, TimescaleDB, Citus, and pgvector
Module 15·23 min read

How Postgres extensions replace entire categories of specialized databases — without leaving ACID behind.

Module 15 — The Extensions Ecosystem: pg_cron, TimescaleDB, Citus, and pgvector

What this module covers: PostgreSQL's extension system is its most underappreciated feature. Extensions add capabilities that match or exceed dedicated NoSQL databases — without leaving the ACID guarantees, SQL interface, and operational tooling you already have. This module covers the four extensions that most frequently eliminate the need for a separate database: pg_cron for background job scheduling, TimescaleDB for time-series at scale, Citus for horizontal sharding, and pgvector for AI/embedding workloads.


Why Extensions Matter

In Module 11, we compared Postgres to MongoDB, Cassandra, and ClickHouse. Those comparisons assumed vanilla Postgres. With extensions, the picture changes:

  • Need time-series with columnar compression? → TimescaleDB makes Postgres competitive with InfluxDB and ClickHouse
  • Need horizontal write scaling? → Citus distributes Postgres across many nodes
  • Need vector similarity search for AI? → pgvector makes Postgres competitive with Pinecone and Weaviate
  • Need background job scheduling? → pg_cron replaces external cron + application-layer job tables

Each extension adds specialized capabilities while keeping the full Postgres stack: MVCC, WAL, ACID, SQL, pg_stat_*, your existing monitoring, your existing connection pooler, your existing backup tooling.


pg_cron: Background Job Scheduling Inside Postgres

pg_cron runs SQL or stored procedures on a cron schedule, directly inside the database process. No external scheduler, no separate worker service, no application-layer job table.

Installation and Setup

sql
-- Add to shared_preload_libraries (requires restart) -- postgresql.conf: -- shared_preload_libraries = 'pg_cron' -- cron.database_name = 'mydb' CREATE EXTENSION pg_cron;

Scheduling Jobs

sql
-- Run ANALYZE on the transactions table every night at 2 AM SELECT cron.schedule( 'nightly-analyze-transactions', '0 2 * * *', 'ANALYZE transactions' ); -- Run partition maintenance every hour SELECT cron.schedule( 'partman-hourly', '0 * * * *', $$SELECT partman.run_maintenance_proc()$$ ); -- Clean up expired sessions every 5 minutes SELECT cron.schedule( 'cleanup-expired-sessions', '*/5 * * * *', $$DELETE FROM sessions WHERE expires_at < now()$$ ); -- Run a stored procedure weekly on Sunday at 3 AM SELECT cron.schedule( 'weekly-aggregation', '0 3 * * 0', $$CALL compute_weekly_stats()$$ );

Managing Jobs

sql
-- List all scheduled jobs SELECT jobid, jobname, schedule, command, active FROM cron.job ORDER BY jobid; -- View recent job run history SELECT jobid, job_pid, database, username, command, status, start_time, end_time, return_message FROM cron.job_run_details ORDER BY start_time DESC LIMIT 20; -- Disable a job without removing it UPDATE cron.job SET active = false WHERE jobname = 'nightly-analyze-transactions'; -- Remove a job SELECT cron.unschedule('nightly-analyze-transactions');

When pg_cron Is the Right Tool

pg_cron is ideal for:

  • Database maintenance tasks (ANALYZE, VACUUM, partition creation)
  • Data retention cleanup (delete rows older than N days)
  • Periodic aggregation into summary tables
  • Refreshing materialized views on a schedule

It is not suitable for:

  • Long-running jobs that need external resources (API calls, file I/O)
  • Jobs that need distributed coordination across multiple Postgres instances
  • Jobs with complex dependency graphs (use Airflow, Prefect, or similar)

TimescaleDB: Time-Series at Postgres Scale

TimescaleDB is an open-source extension that adds automatic time-based partitioning (hypertables), columnar compression, and time-series-specific query functions on top of Postgres.

Hypertables

A hypertable is a Postgres table with automatic time-based partitioning managed by TimescaleDB. Data is partitioned into "chunks" automatically — no manual CREATE TABLE partition ... required.

sql
CREATE EXTENSION timescaledb; -- Convert a regular table to a hypertable CREATE TABLE metrics ( time TIMESTAMPTZ NOT NULL, device_id BIGINT NOT NULL, temperature DOUBLE PRECISION, humidity DOUBLE PRECISION ); -- Convert to hypertable, partitioned by 'time' with 7-day chunks SELECT create_hypertable('metrics', 'time', chunk_time_interval => INTERVAL '7 days'); -- Inserts now automatically route to the correct chunk INSERT INTO metrics VALUES (now(), 1001, 23.5, 68.2);

The hypertable looks and behaves like a regular Postgres table. All standard SQL works. TimescaleDB handles partitioning transparently.

Columnar Compression

TimescaleDB's native columnar compression is the feature that makes it genuinely competitive with InfluxDB and ClickHouse for time-series workloads.

sql
-- Enable compression on chunks older than 7 days ALTER TABLE metrics SET ( timescaledb.compress, timescaledb.compress_segmentby = 'device_id', -- group by device for compression timescaledb.compress_orderby = 'time DESC' ); -- Automate compression with a policy SELECT add_compression_policy('metrics', INTERVAL '7 days'); -- Check compression ratios SELECT chunk_name, pg_size_pretty(before_compression_total_bytes) AS before, pg_size_pretty(after_compression_total_bytes) AS after, ROUND(100.0 * (1 - after_compression_total_bytes::numeric / NULLIF(before_compression_total_bytes, 0)), 1) AS compression_pct FROM chunk_compression_stats('metrics') ORDER BY chunk_name;

Typical compression ratios for time-series data: 10–40x. A 100GB raw metrics table becomes 3–10GB compressed. Compressed chunks are read using a columnar scan — only the requested columns are decompressed, dramatically reducing I/O for aggregation queries.

Continuous Aggregates

Continuous aggregates are materialized views that update automatically as new data arrives:

sql
-- Create a continuous aggregate: hourly averages per device CREATE MATERIALIZED VIEW metrics_hourly WITH (timescaledb.continuous) AS SELECT time_bucket('1 hour', time) AS bucket, device_id, AVG(temperature) AS avg_temp, AVG(humidity) AS avg_humidity, MAX(temperature) AS max_temp, MIN(temperature) AS min_temp FROM metrics GROUP BY bucket, device_id; -- Refresh policy: update the aggregate every 30 minutes for recent data SELECT add_continuous_aggregate_policy( 'metrics_hourly', start_offset => INTERVAL '3 hours', end_offset => INTERVAL '1 hour', schedule_interval => INTERVAL '30 minutes' ); -- Query the aggregate (much faster than raw table for dashboards) SELECT bucket, device_id, avg_temp FROM metrics_hourly WHERE bucket > now() - INTERVAL '24 hours' ORDER BY bucket DESC;

Data Retention

sql
-- Automatically drop chunks older than 90 days SELECT add_retention_policy('metrics', INTERVAL '90 days');

TimescaleDB vs Plain Postgres Partitioning

Plain Postgres PartitioningTimescaleDB
Partition creationManual or pg_partmanAutomatic
Columnar compressionNoYes (10–40x)
Continuous aggregatesManual with triggersNative
Retention policiesManual pg_cron jobNative
SQL compatibilityFullFull
Extension requiredNoYes

TimescaleDB is the right choice when: you have high-volume time-series data (metrics, events, IoT) that needs compression and fast aggregate queries, and you want to stay on Postgres rather than migrating to InfluxDB or ClickHouse.


Citus: Horizontal Sharding for Postgres

Citus distributes a Postgres database across multiple nodes — sharding both data and query execution. It is built by Microsoft and available as an open-source extension or managed service (Azure Cosmos DB for PostgreSQL).

The Citus Architecture

Coordinator node:
  - Accepts all SQL queries
  - Parses and plans queries
  - Routes data operations to the correct worker(s)
  - Aggregates results from workers

Worker nodes (N nodes):
  - Store shards of distributed tables
  - Execute fragments of queries pushed down from coordinator
  - Each shard is a regular Postgres table on the worker

Distribution Key Selection

sql
-- On the coordinator: distribute transactions by sender address SELECT create_distributed_table('transactions', 'sender'); -- Now: -- Rows with the same sender are colocated on the same shard -- Queries filtering by sender go to exactly one worker -- Aggregations by sender are pushed down to workers, results aggregated on coordinator

The distribution key is the most critical decision. Choose it so:

  1. Frequently joined tables are colocated — if transactions and blocks are joined frequently, distribute both by a compatible key
  2. Hot queries filter on itWHERE sender = '0xabc...' goes to one shard, not all shards
  3. Data is evenly distributed — high-cardinality columns like UUIDs or blockchain addresses

Reference Tables

Some tables are small and frequently joined — distribute them to all workers:

sql
-- Reference table: replicated to every worker node SELECT create_reference_table('status_codes'); -- Every worker has a full copy — joins with reference tables are local

When Citus Is the Right Answer

Citus is appropriate when:

  • Write throughput exceeds what a single Postgres instance handles (~50,000+ TPS)
  • Data volume exceeds what a single server stores efficiently (10+ TB)
  • Most queries filter on the distribution key — if queries frequently scan all shards, sharding adds overhead without benefit

For most applications below these thresholds, a single Postgres primary with read replicas is simpler and sufficient.


pgvector: Vector Embeddings for AI Workloads

pgvector adds a vector data type and vector similarity search indexes to Postgres. It is the extension that makes Postgres competitive with dedicated vector databases like Pinecone, Weaviate, and Chroma for AI/RAG (Retrieval-Augmented Generation) workloads.

What Vector Search Is

An embedding is a fixed-size array of floating-point numbers that represents the semantic meaning of text, an image, or any other content. Embeddings with similar meaning have small vector distances. Searching for the most similar embeddings is the core operation in:

  • RAG (Retrieval-Augmented Generation): find documents semantically similar to a query, feed them to an LLM as context
  • Semantic search: return results by meaning, not keyword matching
  • Recommendation systems: find items similar to what a user engaged with
  • Duplicate detection: find near-duplicate content by vector proximity

Installation and Schema

sql
CREATE EXTENSION vector; -- Store 1536-dimensional embeddings (OpenAI ada-002 output size) CREATE TABLE documents ( id BIGSERIAL PRIMARY KEY, content TEXT NOT NULL, embedding vector(1536), metadata JSONB, created_at TIMESTAMPTZ DEFAULT now() );

Inserting Embeddings

python
import openai import psycopg2 # Generate embedding from OpenAI response = openai.embeddings.create( input="PostgreSQL internals for production engineers", model="text-embedding-ada-002" ) embedding = response.data[0].embedding # list of 1536 floats # Insert into Postgres cur.execute( "INSERT INTO documents (content, embedding) VALUES (%s, %s)", ("PostgreSQL internals...", embedding) )

Vector Similarity Search: Exact vs Approximate

Exact search (no index — scans all vectors):

sql
-- Find the 5 most similar documents to a query embedding SELECT id, content, embedding <-> '[0.1, 0.2, ...]'::vector AS distance FROM documents ORDER BY embedding <-> '[0.1, 0.2, ...]'::vector LIMIT 5;

Operators:

  • <-> — L2 (Euclidean) distance
  • <#> — negative inner product (for dot product similarity)
  • <=> — cosine distance (most common for text embeddings)

Exact search is O(N) — every vector is compared. For up to ~100K vectors, this is fast enough. Above that, you need an approximate nearest-neighbor (ANN) index.

IVFFlat Index

IVFFlat (Inverted File Flat) divides vectors into lists clusters. A query searches only the nearest N clusters instead of all vectors.

sql
-- Build IVFFlat index: 100 lists for ~1M vectors -- Rule of thumb: lists ≈ sqrt(row_count) CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100); -- Query with probe count: search nearest probes clusters -- Higher probes = better recall, slower query SET ivfflat.probes = 10; -- search 10 of 100 clusters SELECT id, content, embedding <=> $1 AS distance FROM documents ORDER BY embedding <=> $1 LIMIT 5;

Recall vs speed trade-off:

  • probes = 1: fastest, lowest recall (~70% of exact results)
  • probes = 10: balanced (~90% recall)
  • probes = lists: exact search (same as no index)

Build IVFFlat only after the table has data — the index quality depends on the data distribution at build time. Rebuild if data distribution changes significantly.

HNSW (Hierarchical Navigable Small World) builds a graph-based index with better recall/speed trade-offs than IVFFlat for most workloads.

sql
-- Build HNSW index -- m: number of connections per node (higher = better recall, larger index) -- ef_construction: search width during build (higher = better index, slower build) CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64); -- Query: ef_search controls recall at query time (higher = better recall, slower) SET hnsw.ef_search = 100; SELECT id, content, embedding <=> $1 AS distance FROM documents ORDER BY embedding <=> $1 LIMIT 5;

HNSW vs IVFFlat:

IVFFlatHNSW
Build timeFastSlower (especially for large datasets)
MemoryLowerHigher (graph stored in memory)
Recall at same speedLowerHigher
Incremental insertsRequires rebuild for best qualityGood — inserts update the graph
Best for>1M vectors, memory-constrainedMost workloads, especially with inserts

Hybrid Search: Vector + SQL Filters

The real power of pgvector in Postgres: combine semantic vector search with SQL predicates in one query:

sql
-- Find the 5 most similar documents to a query embedding -- but only from the 'technical' category published in 2026 SELECT id, content, embedding <=> $1 AS distance FROM documents WHERE category = 'technical' AND created_at >= '2026-01-01' ORDER BY embedding <=> $1 LIMIT 5;

A dedicated vector database requires an additional filter step after the ANN search — often requiring you to fetch more candidates than needed and filter in the application. Postgres does this in a single query with the planner's cost model deciding the optimal execution order.

pgvector vs Dedicated Vector Databases

pgvectorPinecone / Weaviate
ACID transactionsYesNo
SQL filters in same queryNativePost-processing
Operational complexitySame as PostgresNew system to operate
Metadata storageJSONB + columnsSeparate metadata store
Scale~100M vectors (1536-dim, 128GB RAM)Unlimited (distributed)
CostPostgres hosting costPer-vector pricing

For most RAG and semantic search applications below 100M vectors: pgvector in Postgres eliminates the need for a dedicated vector database entirely. Your embeddings, metadata, and business data live in one system with consistent transactions.


Summary

ExtensionReplacesWhen to Use
pg_cronExternal cron + job tablesDB maintenance, data cleanup, periodic aggregation
TimescaleDBInfluxDB, ClickHouse (time-series)High-volume time-series, IoT metrics, automated compression
CitusManual sharding, Cassandra (write scale)>50K TPS write throughput or >10TB single-table datasets
pgvectorPinecone, Weaviate, ChromaRAG, semantic search, recommendation systems, <100M vectors

The principle: before adding a new database to your infrastructure, check if a Postgres extension solves the problem. The operational cost of one well-understood Postgres cluster is almost always lower than operating two specialized systems.

Discussion