The most architecturally significant change to Postgres in a decade — the definitive technical breakdown.
Module 8 — PostgreSQL 18: Asynchronous I/O and What It Changes
What this module covers: PostgreSQL 18 ships the most architecturally significant change to the storage layer in the project's history: native asynchronous I/O. Every module in this course has described a world where Postgres processes one I/O request at a time per process. PostgreSQL 18 changes that. This module explains what async I/O means, how it is implemented, what it changes for sequential scans, checkpoints, vacuum, and replication, and how to tune it for your workload.
The Problem: Synchronous I/O as a Throughput Ceiling
Every I/O operation in PostgreSQL prior to version 18 follows the same pattern:
- Backend or background process needs a page from disk
- It calls
read()(orpread()) — a synchronous system call - The OS schedules the I/O, the process blocks waiting for it to complete
- The page arrives, the process continues
This is the simplest model, and it works well when:
- Data is in the OS page cache (no real I/O, cache hit is fast)
shared_buffersis large enough to hold the working set- I/O latency is not the bottleneck
The problem: on modern NVMe storage, the disk can serve multiple I/O requests simultaneously. An NVMe SSD with a 32-deep I/O queue can be saturating its bandwidth while Postgres submits exactly one request at a time per process, waits for it, then submits the next one. Postgres is artificially serializing I/O that the hardware can parallelize.
The Sequential Scan Example
A sequential scan of a 10GB table on NVMe:
Before PG18 (synchronous):
- Process reads page 0, waits ~50μs
- Process reads page 1, waits ~50μs
- ...
- Total: 1,310,720 pages × 50μs = 65 seconds
Theoretical with async (queue depth 32):
- Process submits 32 read requests simultaneously
- Hardware handles them in parallel
- Effective throughput: 32 × baseline ÷ roughly same latency
- Total: ~2 seconds at full NVMe bandwidth
The real-world gain is not 32x — there is overhead, coordination cost, and the OS page cache intervenes — but 2–5x throughput improvement on I/O-bound workloads is consistently demonstrated in benchmarks.
Why Postgres Didn't Have Async I/O Earlier
The multi-process architecture (Module 0) made async I/O difficult. In a multi-threaded database (Oracle, SQL Server), threads can submit async I/O requests and the thread pool can process completions cooperatively. In Postgres, each backend is an independent OS process. Coordinating async I/O across processes — especially for shared buffer management — required significant architectural work.
The solution that shipped in PG18 took years of design and implementation across multiple development cycles.
The PG18 Async I/O Architecture
Two Backends: io_uring and Worker Threads
PG18 implements async I/O through a pluggable backend abstraction. Two backends ship:
io_uring (Linux 5.1+):
- Uses the Linux
io_uringinterface — a kernel-level async I/O ring buffer - Zero-copy submission: Postgres writes I/O requests to a shared ring buffer without a system call
- Completions are polled from the completion ring
- Lowest overhead, highest throughput on modern Linux
- Requires Linux kernel ≥ 5.1 (practically: ≥ 5.10 LTS for production use)
Worker threads (cross-platform fallback):
- Spawns a pool of I/O worker threads
- Threads perform synchronous
pread()/pwrite()calls on behalf of the requesting process - The requesting process submits work to the thread pool and continues, polling for completions
- Works on Linux, macOS, Windows — anywhere Postgres runs
- Higher overhead than
io_uringbut significantly better than per-process blocking I/O
ini# Control which backend is used io_method = io_uring # Linux only, highest performance io_method = worker # Cross-platform, lower overhead than sync io_method = sync # Pre-PG18 behavior, synchronous I/O # Default in PG18: # Linux with io_uring support: io_uring # Everything else: worker
The I/O Concurrency Model
Within a single Postgres process (backend or background worker), the async I/O system allows multiple I/O requests to be in-flight simultaneously. The concurrency limit:
ini# Max simultaneous I/O requests per Postgres process io_max_concurrency = 16 # default; increase for high-latency storage # Total I/O requests that can be in-flight across the entire server # (approximately: io_max_concurrency × max_connections)
When a process needs a page:
- It submits an async read request (non-blocking)
- It continues processing — prefetching more page requests, executing other plan nodes
- It polls for completion when it actually needs the page data
- If the page is ready, it proceeds immediately; if not, it waits briefly
This prefetch-ahead pattern is where the throughput gain comes from.
What Changes for Each Subsystem
Sequential Scans
Sequential scans see the most dramatic improvement. The executor now submits read-ahead I/O requests for upcoming pages while processing current pages.
The effective_io_concurrency parameter takes on new meaning:
ini# Before PG18: hint for bitmap heap scan prefetching only # After PG18: controls prefetch depth for all sequential I/O operations effective_io_concurrency = 16 # for SSD: 16–64 # for NVMe: 64–256 # for HDD: 2–4
In practice: a sequential scan with effective_io_concurrency = 64 submits 64 page reads ahead of the current position. By the time the executor reaches those pages, they are already in the buffer cache. On cold storage, this transforms sequential scan throughput from single-queue to deep-queue performance.
sql-- Measure sequential scan throughput before/after \timing on SET effective_io_concurrency = 1; SELECT count(*) FROM transactions; -- baseline SET effective_io_concurrency = 64; SELECT count(*) FROM transactions; -- async I/O benefit
Checkpoints
Checkpoints (Module 3) write all dirty shared_buffers pages to disk. Before PG18, the checkpointer submitted one write at a time, with checkpoint_completion_target spreading the writes over time.
In PG18, the checkpointer submits batches of async writes. This changes the checkpoint I/O profile:
Before: flat, throttled write rate over the checkpoint window After PG18: burst submission followed by hardware-parallel completion — can complete the same work in less wall-clock time with the same I/O pressure
The practical implication: checkpoint_completion_target remains relevant for spreading I/O pressure on mixed read/write workloads, but the floor on checkpoint completion time drops significantly on fast storage.
sql-- Monitor checkpoint I/O in PG18 SELECT checkpoints_timed, checkpoints_req, checkpoint_write_time, -- wall clock time writing dirty pages checkpoint_sync_time, -- time waiting for fsync buffers_checkpoint FROM pg_stat_bgwriter; -- With async I/O: checkpoint_write_time should drop for same buffer count
VACUUM
VACUUM (Module 4) performs sequential reads of the heap and sequential reads/writes of indexes. In PG18, autovacuum workers submit async reads during heap scanning, prefetching pages ahead of the current position.
The dead tuple processing phase — which previously waited for each index page fetch — can now overlap index I/O with heap processing.
Practical result: autovacuum throughput on cold tables (pages not in shared_buffers) improves significantly. For tables that fit in shared_buffers, the benefit is minimal (cache hits have no I/O latency to overlap).
ini# Vacuum's prefetch depth — new parameter in PG18 vacuum_prefetch_buffer_factor = 2.0 # default; how far ahead to prefetch
WAL Writer and Background Writer
Both the WAL writer and background writer submit async writes in PG18. For the WAL writer, this means WAL segment writes can overlap with the next batch of WAL record processing. For the background writer, dirty page eviction from shared_buffers can run deeper I/O queues.
The bgwriter_lru_maxpages and bgwriter_delay parameters still control the pace, but the underlying writes are now non-blocking within each background writer cycle.
Replication: WAL Receiver
The WAL receiver on standbys (Module 3) writes received WAL to disk using async I/O in PG18. For high-throughput replication scenarios, this reduces the flush-lag component of replication lag — the standby can begin flushing a WAL segment to disk while still receiving the next segment.
Measuring the Impact
pg_stat_io: The New I/O Visibility View
PG18 ships pg_stat_io — a new system view that breaks down I/O statistics by process type, operation, and object type. This is the first time Postgres has offered per-subsystem I/O observability built into the core.
sql-- I/O statistics by backend type and operation SELECT backend_type, object, context, reads, read_time, writes, write_time, extends, extend_time, hits, evictions, reuses FROM pg_stat_io ORDER BY read_time DESC;
sql-- Where is most read time going? SELECT backend_type, object, reads, ROUND(read_time::numeric, 2) AS read_ms, ROUND(read_time / NULLIF(reads, 0), 4) AS avg_read_ms FROM pg_stat_io WHERE reads > 0 ORDER BY read_time DESC LIMIT 10;
This view answers questions that previously required external monitoring:
- Which process type is doing the most disk reads? (autovacuum vs backends vs checkpointer)
- What is the average I/O latency per read? (useful for detecting storage degradation)
- How many pages are being evicted from
shared_buffersvs reused?
pg_stat_io is also available in PG16+, not just PG18 — but async I/O makes the data significantly more actionable.
Benchmarking Your Workload
sql-- Reset I/O stats SELECT pg_stat_reset_shared('io'); -- Run your benchmark workload -- ... -- Check async I/O effectiveness SELECT backend_type, reads, read_time, ROUND(reads::numeric / NULLIF(read_time, 0) * 1000, 0) AS reads_per_second FROM pg_stat_io WHERE reads > 1000 ORDER BY reads DESC;
Compare reads/second against your storage's rated IOPS to see how close to hardware limits Postgres is getting.
Configuration for PG18 Async I/O
The Key Parameters
ini# I/O method — set at server start io_method = io_uring # Linux + modern kernel io_method = worker # macOS, Windows, older Linux # Per-process I/O concurrency io_max_concurrency = 16 # default; increase for high-latency storage (16–64) # Prefetch depth for sequential scans and vacuum effective_io_concurrency = 32 # SSD: 16–64; NVMe: 64–256 # Maintenance operations (vacuum, index builds) maintenance_io_concurrency = 16 # can be lower than effective_io_concurrency # to reduce autovacuum I/O pressure
Tuning for Different Storage Profiles
NVMe SSD (local, PCIe 4.0):
iniio_method = io_uring io_max_concurrency = 32 effective_io_concurrency = 200 maintenance_io_concurrency = 64 random_page_cost = 1.1
Network-attached SSD (EBS gp3, cloud storage):
iniio_method = io_uring # or worker if io_uring unavailable io_max_concurrency = 64 # higher concurrency for higher latency effective_io_concurrency = 128 maintenance_io_concurrency = 32 random_page_cost = 1.5
HDD (spinning disk):
iniio_method = sync # async I/O is less beneficial for HDD # seeks dominate, not queue depth effective_io_concurrency = 2 random_page_cost = 4.0 # default is still correct for HDD
shared_buffers Still Matters
Async I/O improves throughput for data not in shared_buffers. For data that is in shared_buffers, async I/O provides no benefit — a cache hit has no I/O latency to overlap.
The interaction: with async I/O, you may find that a smaller shared_buffers becomes acceptable because cold reads are now faster. But this is workload-dependent. The general guidance remains: size shared_buffers to hold your active working set. Async I/O is an improvement to cold-cache performance, not a replacement for having a warm cache.
What Does Not Change
Understanding what async I/O does not change is as important as understanding what it does.
MVCC and Dead Tuple Accumulation
Async I/O does not change the MVCC model. Dead tuples still accumulate. Autovacuum still needs to be tuned. XID wraparound is still a risk. The storage mechanics from Module 2 are unchanged — async I/O just makes the I/O operations within those mechanics faster.
WAL Write Path
Commits still require WAL to be flushed to disk before returning to the client (synchronous_commit = on). Async I/O makes the WAL write itself faster (lower latency from io_uring vs blocking write) but does not change the synchronous-commit semantics. The durability guarantee is preserved.
Lock Acquisition and MVCC Snapshots
No changes to locking or the snapshot mechanism. Async I/O is purely a storage-layer optimization. Everything above the buffer manager — the executor, the planner, MVCC visibility checks, lock acquisition — is unchanged.
Connection Overhead
The multi-process architecture is unchanged. Each connection is still a separate OS process. Connection pooling (PgBouncer) is still mandatory above a few hundred connections. Async I/O does not make connection overhead cheaper.
Full Page Writes
Full page writes still happen after each checkpoint. The WAL volume generated by FPWs is unchanged. Async I/O just makes the WAL write itself happen faster.
io_uring: The Linux Kernel Interface
For engineers who want to understand the mechanism:
io_uring (introduced in Linux 5.1, stabilized in 5.10) provides two ring buffers shared between user space and kernel space:
- Submission Queue (SQ): user space writes I/O requests (read page X, write page Y) into this ring
- Completion Queue (CQ): kernel writes completion results into this ring when I/O finishes
The critical property: submission requires no system call in the common case. Postgres writes the request to the SQ ring and the kernel picks it up asynchronously. This eliminates the syscall overhead that dominated async I/O approaches before io_uring.
Postgres process:
sqe = io_uring_get_sqe(&ring) // get submission entry
io_uring_prep_read(sqe, fd, buf, len, offset) // fill in read request
io_uring_submit(&ring) // submit batch (syscall only for batch)
// ... do other work ...
io_uring_wait_cqe(&ring, &cqe) // wait for completion (or poll)
result = cqe->res // bytes read or error
io_uring_cqe_seen(&ring, cqe) // mark completion consumed
PG18's io_uring backend wraps this interface, allowing backends and background workers to submit batches of page reads/writes without blocking.
The Kernel Version Requirement
io_uring has had security issues in older kernels. Postgres's official recommendation:
- Minimum: Linux 5.1 (io_uring first available)
- Recommended: Linux 5.19+ (significant io_uring stability improvements)
- Production: Linux 6.x LTS (most mature io_uring implementation)
On kernels below 5.1, io_method = worker is the only option and is set automatically.
sql-- Check which io_method is active SHOW io_method; -- Check kernel version from within Postgres (Linux) SELECT current_setting('server_version'), version();
Interaction with the OS Page Cache
An important nuance: async I/O in PG18 works through the OS page cache (buffered I/O), not direct I/O (O_DIRECT). This means:
- Postgres submits an async read for a page
- The OS checks its page cache — cache hit: returns immediately
- Cache miss: OS schedules actual disk I/O, delivers page when complete
The performance benefit of async I/O is therefore most visible when the OS page cache is cold (data not in OS cache). When the OS page cache is warm (recently accessed data), async I/O adds minimal benefit because cache hits return immediately regardless.
For dedicated Postgres servers where the OS page cache effectively extends shared_buffers, this means:
- Frequently-accessed data: minimal async I/O benefit (already in OS cache)
- Infrequently-accessed data / full table scans: significant benefit (cache misses overlap with processing)
Future versions of Postgres may implement O_DIRECT support to bypass the OS cache entirely, giving Postgres direct control over I/O scheduling — but this requires additional complexity in the buffer management layer.
Summary
| Aspect | Before PG18 | After PG18 |
|---|---|---|
| Sequential scan I/O | One page at a time, blocking | Prefetch-ahead, non-blocking, hardware-parallel |
| Checkpoint writes | Sequential, throttled | Batched async writes |
| Autovacuum heap scan | Blocking per-page reads | Prefetch-ahead reads |
| WAL writer | Synchronous writes | Async writes (lower latency) |
| I/O visibility | Estimated via pg_stat_bgwriter | Precise per-subsystem in pg_stat_io |
| MVCC semantics | Unchanged | Unchanged |
| Durability guarantees | Unchanged | Unchanged |
| Connection model | Multi-process | Unchanged |
| Config key | effective_io_concurrency | io_method, io_max_concurrency, effective_io_concurrency |
The single most important takeaway: async I/O is a throughput improvement for I/O-bound workloads. If your workload is CPU-bound (query planning, hash join computation, sort operations) or cache-bound (working set fits in shared_buffers), the impact will be minimal. Profile with pg_stat_io to understand where your time actually goes before expecting dramatic gains.
Module 9 covers the other major PG18 addition: OLD and NEW aliases in the RETURNING clause — a small syntactic change that eliminates entire classes of application-level race conditions.
Next: Module 9 — The RETURNING Clause Evolved: OLD/NEW Aliases and Eliminating Race Conditions →