Module 8·27 min read

The most architecturally significant change to Postgres in a decade — the definitive technical breakdown.

Module 8 — PostgreSQL 18: Asynchronous I/O and What It Changes

What this module covers: PostgreSQL 18 ships the most architecturally significant change to the storage layer in the project's history: native asynchronous I/O. Every module in this course has described a world where Postgres processes one I/O request at a time per process. PostgreSQL 18 changes that. This module explains what async I/O means, how it is implemented, what it changes for sequential scans, checkpoints, vacuum, and replication, and how to tune it for your workload.

The Problem: Synchronous I/O as a Throughput Ceiling

Every I/O operation in PostgreSQL prior to version 18 follows the same pattern:

Backend or background process needs a page from disk
It calls read() (or pread()) — a synchronous system call
The OS schedules the I/O, the process blocks waiting for it to complete
The page arrives, the process continues

This is the simplest model, and it works well when:

Data is in the OS page cache (no real I/O, cache hit is fast)
shared_buffers is large enough to hold the working set
I/O latency is not the bottleneck

The problem: on modern NVMe storage, the disk can serve multiple I/O requests simultaneously. An NVMe SSD with a 32-deep I/O queue can be saturating its bandwidth while Postgres submits exactly one request at a time per process, waits for it, then submits the next one. Postgres is artificially serializing I/O that the hardware can parallelize.

The Sequential Scan Example

A sequential scan of a 10GB table on NVMe:

Before PG18 (synchronous):

Process reads page 0, waits ~50μs
Process reads page 1, waits ~50μs
...
Total: 1,310,720 pages × 50μs = 65 seconds

Theoretical with async (queue depth 32):

Process submits 32 read requests simultaneously
Hardware handles them in parallel
Effective throughput: 32 × baseline ÷ roughly same latency
Total: ~2 seconds at full NVMe bandwidth

The real-world gain is not 32x — there is overhead, coordination cost, and the OS page cache intervenes — but 2–5x throughput improvement on I/O-bound workloads is consistently demonstrated in benchmarks.

Why Postgres Didn't Have Async I/O Earlier

The multi-process architecture (Module 0) made async I/O difficult. In a multi-threaded database (Oracle, SQL Server), threads can submit async I/O requests and the thread pool can process completions cooperatively. In Postgres, each backend is an independent OS process. Coordinating async I/O across processes — especially for shared buffer management — required significant architectural work.

The solution that shipped in PG18 took years of design and implementation across multiple development cycles.

The PG18 Async I/O Architecture

Two Backends: io_uring and Worker Threads

PG18 implements async I/O through a pluggable backend abstraction. Two backends ship:

io_uring (Linux 5.1+):

Uses the Linux io_uring interface — a kernel-level async I/O ring buffer
Zero-copy submission: Postgres writes I/O requests to a shared ring buffer without a system call
Completions are polled from the completion ring
Lowest overhead, highest throughput on modern Linux
Requires Linux kernel ≥ 5.1 (practically: ≥ 5.10 LTS for production use)

Worker threads (cross-platform fallback):

Spawns a pool of I/O worker threads
Threads perform synchronous pread()/pwrite() calls on behalf of the requesting process
The requesting process submits work to the thread pool and continues, polling for completions
Works on Linux, macOS, Windows — anywhere Postgres runs
Higher overhead than io_uring but significantly better than per-process blocking I/O

ini
# Control which backend is used
io_method = io_uring        # Linux only, highest performance
io_method = worker          # Cross-platform, lower overhead than sync
io_method = sync            # Pre-PG18 behavior, synchronous I/O

# Default in PG18:
# Linux with io_uring support: io_uring
# Everything else: worker

The I/O Concurrency Model

Within a single Postgres process (backend or background worker), the async I/O system allows multiple I/O requests to be in-flight simultaneously. The concurrency limit:

ini
# Max simultaneous I/O requests per Postgres process
io_max_concurrency = 16     # default; increase for high-latency storage

# Total I/O requests that can be in-flight across the entire server
# (approximately: io_max_concurrency × max_connections)

When a process needs a page:

It submits an async read request (non-blocking)
It continues processing — prefetching more page requests, executing other plan nodes
It polls for completion when it actually needs the page data
If the page is ready, it proceeds immediately; if not, it waits briefly

This prefetch-ahead pattern is where the throughput gain comes from.

What Changes for Each Subsystem

Sequential Scans

Sequential scans see the most dramatic improvement. The executor now submits read-ahead I/O requests for upcoming pages while processing current pages.

The effective_io_concurrency parameter takes on new meaning:

ini
# Before PG18: hint for bitmap heap scan prefetching only
# After PG18: controls prefetch depth for all sequential I/O operations
effective_io_concurrency = 16   # for SSD: 16–64
                                 # for NVMe: 64–256
                                 # for HDD: 2–4

In practice: a sequential scan with effective_io_concurrency = 64 submits 64 page reads ahead of the current position. By the time the executor reaches those pages, they are already in the buffer cache. On cold storage, this transforms sequential scan throughput from single-queue to deep-queue performance.

sql
-- Measure sequential scan throughput before/after
\timing on

SET effective_io_concurrency = 1;
SELECT count(*) FROM transactions;  -- baseline

SET effective_io_concurrency = 64;
SELECT count(*) FROM transactions;  -- async I/O benefit

Checkpoints

Checkpoints (Module 3) write all dirty shared_buffers pages to disk. Before PG18, the checkpointer submitted one write at a time, with checkpoint_completion_target spreading the writes over time.

In PG18, the checkpointer submits batches of async writes. This changes the checkpoint I/O profile:

Before: flat, throttled write rate over the checkpoint window After PG18: burst submission followed by hardware-parallel completion — can complete the same work in less wall-clock time with the same I/O pressure

The practical implication: checkpoint_completion_target remains relevant for spreading I/O pressure on mixed read/write workloads, but the floor on checkpoint completion time drops significantly on fast storage.

sql
-- Monitor checkpoint I/O in PG18
SELECT
  checkpoints_timed,
  checkpoints_req,
  checkpoint_write_time,     -- wall clock time writing dirty pages
  checkpoint_sync_time,      -- time waiting for fsync
  buffers_checkpoint
FROM pg_stat_bgwriter;

-- With async I/O: checkpoint_write_time should drop for same buffer count

VACUUM

VACUUM (Module 4) performs sequential reads of the heap and sequential reads/writes of indexes. In PG18, autovacuum workers submit async reads during heap scanning, prefetching pages ahead of the current position.

The dead tuple processing phase — which previously waited for each index page fetch — can now overlap index I/O with heap processing.

Practical result: autovacuum throughput on cold tables (pages not in shared_buffers) improves significantly. For tables that fit in shared_buffers, the benefit is minimal (cache hits have no I/O latency to overlap).

ini
# Vacuum's prefetch depth — new parameter in PG18
vacuum_prefetch_buffer_factor = 2.0  # default; how far ahead to prefetch

WAL Writer and Background Writer

Both the WAL writer and background writer submit async writes in PG18. For the WAL writer, this means WAL segment writes can overlap with the next batch of WAL record processing. For the background writer, dirty page eviction from shared_buffers can run deeper I/O queues.

The bgwriter_lru_maxpages and bgwriter_delay parameters still control the pace, but the underlying writes are now non-blocking within each background writer cycle.

Replication: WAL Receiver

The WAL receiver on standbys (Module 3) writes received WAL to disk using async I/O in PG18. For high-throughput replication scenarios, this reduces the flush-lag component of replication lag — the standby can begin flushing a WAL segment to disk while still receiving the next segment.

Measuring the Impact

pg_stat_io: The New I/O Visibility View

PG18 ships pg_stat_io — a new system view that breaks down I/O statistics by process type, operation, and object type. This is the first time Postgres has offered per-subsystem I/O observability built into the core.

sql
-- I/O statistics by backend type and operation
SELECT
  backend_type,
  object,
  context,
  reads,
  read_time,
  writes,
  write_time,
  extends,
  extend_time,
  hits,
  evictions,
  reuses
FROM pg_stat_io
ORDER BY read_time DESC;

sql
-- Where is most read time going?
SELECT
  backend_type,
  object,
  reads,
  ROUND(read_time::numeric, 2) AS read_ms,
  ROUND(read_time / NULLIF(reads, 0), 4) AS avg_read_ms
FROM pg_stat_io
WHERE reads > 0
ORDER BY read_time DESC
LIMIT 10;

This view answers questions that previously required external monitoring:

Which process type is doing the most disk reads? (autovacuum vs backends vs checkpointer)
What is the average I/O latency per read? (useful for detecting storage degradation)
How many pages are being evicted from shared_buffers vs reused?

pg_stat_io is also available in PG16+, not just PG18 — but async I/O makes the data significantly more actionable.

Benchmarking Your Workload

sql
-- Reset I/O stats
SELECT pg_stat_reset_shared('io');

-- Run your benchmark workload
-- ...

-- Check async I/O effectiveness
SELECT
  backend_type,
  reads,
  read_time,
  ROUND(reads::numeric / NULLIF(read_time, 0) * 1000, 0) AS reads_per_second
FROM pg_stat_io
WHERE reads > 1000
ORDER BY reads DESC;

Compare reads/second against your storage's rated IOPS to see how close to hardware limits Postgres is getting.

Configuration for PG18 Async I/O

The Key Parameters

ini
# I/O method — set at server start
io_method = io_uring            # Linux + modern kernel
io_method = worker              # macOS, Windows, older Linux

# Per-process I/O concurrency
io_max_concurrency = 16         # default; increase for high-latency storage (16–64)

# Prefetch depth for sequential scans and vacuum
effective_io_concurrency = 32   # SSD: 16–64; NVMe: 64–256

# Maintenance operations (vacuum, index builds)
maintenance_io_concurrency = 16  # can be lower than effective_io_concurrency
                                  # to reduce autovacuum I/O pressure

Tuning for Different Storage Profiles

NVMe SSD (local, PCIe 4.0):

ini
io_method = io_uring
io_max_concurrency = 32
effective_io_concurrency = 200
maintenance_io_concurrency = 64
random_page_cost = 1.1

Network-attached SSD (EBS gp3, cloud storage):

ini
io_method = io_uring            # or worker if io_uring unavailable
io_max_concurrency = 64         # higher concurrency for higher latency
effective_io_concurrency = 128
maintenance_io_concurrency = 32
random_page_cost = 1.5

HDD (spinning disk):

ini
io_method = sync                # async I/O is less beneficial for HDD
                                # seeks dominate, not queue depth
effective_io_concurrency = 2
random_page_cost = 4.0          # default is still correct for HDD

shared_buffers Still Matters

Async I/O improves throughput for data not in shared_buffers. For data that is in shared_buffers, async I/O provides no benefit — a cache hit has no I/O latency to overlap.

The interaction: with async I/O, you may find that a smaller shared_buffers becomes acceptable because cold reads are now faster. But this is workload-dependent. The general guidance remains: size shared_buffers to hold your active working set. Async I/O is an improvement to cold-cache performance, not a replacement for having a warm cache.

What Does Not Change

Understanding what async I/O does not change is as important as understanding what it does.

MVCC and Dead Tuple Accumulation

Async I/O does not change the MVCC model. Dead tuples still accumulate. Autovacuum still needs to be tuned. XID wraparound is still a risk. The storage mechanics from Module 2 are unchanged — async I/O just makes the I/O operations within those mechanics faster.

WAL Write Path

Commits still require WAL to be flushed to disk before returning to the client (synchronous_commit = on). Async I/O makes the WAL write itself faster (lower latency from io_uring vs blocking write) but does not change the synchronous-commit semantics. The durability guarantee is preserved.

Lock Acquisition and MVCC Snapshots

No changes to locking or the snapshot mechanism. Async I/O is purely a storage-layer optimization. Everything above the buffer manager — the executor, the planner, MVCC visibility checks, lock acquisition — is unchanged.

Connection Overhead

The multi-process architecture is unchanged. Each connection is still a separate OS process. Connection pooling (PgBouncer) is still mandatory above a few hundred connections. Async I/O does not make connection overhead cheaper.

Full Page Writes

Full page writes still happen after each checkpoint. The WAL volume generated by FPWs is unchanged. Async I/O just makes the WAL write itself happen faster.

io_uring: The Linux Kernel Interface

For engineers who want to understand the mechanism:

io_uring (introduced in Linux 5.1, stabilized in 5.10) provides two ring buffers shared between user space and kernel space:

Submission Queue (SQ): user space writes I/O requests (read page X, write page Y) into this ring
Completion Queue (CQ): kernel writes completion results into this ring when I/O finishes

The critical property: submission requires no system call in the common case. Postgres writes the request to the SQ ring and the kernel picks it up asynchronously. This eliminates the syscall overhead that dominated async I/O approaches before io_uring.

Postgres process:
  sqe = io_uring_get_sqe(&ring)   // get submission entry
  io_uring_prep_read(sqe, fd, buf, len, offset)  // fill in read request
  io_uring_submit(&ring)           // submit batch (syscall only for batch)

  // ... do other work ...

  io_uring_wait_cqe(&ring, &cqe)  // wait for completion (or poll)
  result = cqe->res               // bytes read or error
  io_uring_cqe_seen(&ring, cqe)   // mark completion consumed

PG18's io_uring backend wraps this interface, allowing backends and background workers to submit batches of page reads/writes without blocking.

The Kernel Version Requirement

io_uring has had security issues in older kernels. Postgres's official recommendation:

Minimum: Linux 5.1 (io_uring first available)
Recommended: Linux 5.19+ (significant io_uring stability improvements)
Production: Linux 6.x LTS (most mature io_uring implementation)

On kernels below 5.1, io_method = worker is the only option and is set automatically.

sql
-- Check which io_method is active
SHOW io_method;

-- Check kernel version from within Postgres (Linux)
SELECT current_setting('server_version'), version();

Interaction with the OS Page Cache

An important nuance: async I/O in PG18 works through the OS page cache (buffered I/O), not direct I/O (O_DIRECT). This means:

Postgres submits an async read for a page
The OS checks its page cache — cache hit: returns immediately
Cache miss: OS schedules actual disk I/O, delivers page when complete

The performance benefit of async I/O is therefore most visible when the OS page cache is cold (data not in OS cache). When the OS page cache is warm (recently accessed data), async I/O adds minimal benefit because cache hits return immediately regardless.

For dedicated Postgres servers where the OS page cache effectively extends shared_buffers, this means:

Frequently-accessed data: minimal async I/O benefit (already in OS cache)
Infrequently-accessed data / full table scans: significant benefit (cache misses overlap with processing)

Future versions of Postgres may implement O_DIRECT support to bypass the OS cache entirely, giving Postgres direct control over I/O scheduling — but this requires additional complexity in the buffer management layer.

Summary

Aspect	Before PG18	After PG18
Sequential scan I/O	One page at a time, blocking	Prefetch-ahead, non-blocking, hardware-parallel
Checkpoint writes	Sequential, throttled	Batched async writes
Autovacuum heap scan	Blocking per-page reads	Prefetch-ahead reads
WAL writer	Synchronous writes	Async writes (lower latency)
I/O visibility	Estimated via `pg_stat_bgwriter`	Precise per-subsystem in `pg_stat_io`
MVCC semantics	Unchanged	Unchanged
Durability guarantees	Unchanged	Unchanged
Connection model	Multi-process	Unchanged
Config key	`effective_io_concurrency`	`io_method`, `io_max_concurrency`, `effective_io_concurrency`

The single most important takeaway: async I/O is a throughput improvement for I/O-bound workloads. If your workload is CPU-bound (query planning, hash join computation, sort operations) or cache-bound (working set fits in shared_buffers), the impact will be minimal. Profile with pg_stat_io to understand where your time actually goes before expecting dramatic gains.

Module 9 covers the other major PG18 addition: OLD and NEW aliases in the RETURNING clause — a small syntactic change that eliminates entire classes of application-level race conditions.

Next: Module 9 — The RETURNING Clause Evolved: OLD/NEW Aliases and Eliminating Race Conditions →

PreviousModule 7: Schema Design at Scale: Decisions That Cannot Be Undone Next Module 9: The RETURNING Clause Evolved: OLD/NEW Aliases and Eliminating Race Conditions