The Redlock algorithm step by step across N independent Redis instances, what it guarantees under bounded clock drift, what Martin Kleppmann's critique gets right, fencing tokens as the correct complement, and implementation with redlock-node.
A-4 — Redlock: The Algorithm, Its Guarantees, and Its Critics
Who this module is for: You use a single Redis instance for distributed locking but want the lock to survive a Redis node failure. Redlock is Redis's multi-instance distributed lock algorithm. This module covers the algorithm step-by-step, what it actually guarantees, Martin Kleppmann's critique (the most important distributed systems analysis of Redis locking), and when Redlock is and is not the right tool.
The Problem with Single-Instance Locking
A single-instance Redis lock has one critical failure mode: if the Redis master fails after granting a lock but before the lock holder releases it, and a failover promotes a replica, the replica does not have the lock (replication is asynchronous — the SET NX may not have replicated before the primary failed). The new primary issues the lock to a new client. Now two clients hold the same lock simultaneously.
Redlock addresses this by requiring a lock to be acquired on a majority of independent Redis instances.
The Redlock Algorithm
Redlock requires an odd number of independent Redis instances — typically 5. "Independent" means separate machines with no replication between them. A failure of any minority (< N/2) of instances does not affect lock correctness.
Step-by-Step
N = 5 independent Redis instances
Validity time = 30 seconds (desired lock TTL)
1. Record the start time: T1 = current time in milliseconds
2. For each of the N instances, attempt to acquire the lock:
SET lock:resource {uuid} NX PX {validity_ms}
Use a small per-instance timeout (e.g., 5–50ms) to avoid blocking if an instance is down.
3. Count successful acquisitions. Calculate elapsed time: T2 = current time
Elapsed = T2 - T1
4. The lock is acquired if:
a. Acquired on at least (N/2 + 1) = 3 of 5 instances (quorum)
b. Remaining validity time > 0: (validity_ms - elapsed) > 0
5. If the lock is acquired:
The "actual TTL" = validity_ms - elapsed - clock_drift_factor
The holder can use the resource for at most this remaining time.
6. If the lock was NOT acquired (fewer than quorum, or time expired):
Release the lock on all instances where it was acquired (to clean up).
Wait a random backoff time before retrying.
Why Quorum?
If 3 of 5 instances grant the lock, no other client can simultaneously get quorum — at best, they see 2 instances (the 2 that failed or are unreachable). 2 < 3, so they cannot acquire the lock. The majority ensures safety.
Why Deduct Elapsed Time?
The lock is set on each instance with the full validity_ms. But by the time the algorithm finishes acquiring on all instances, some time has passed. The actual safe window for holding the lock is reduced by this elapsed time plus a small clock drift factor.
Implementation: redlock-node
bashnpm install redlock
typescriptimport Redlock from 'redlock'; import Redis from 'ioredis'; // 5 independent Redis instances const clients = [ new Redis({ host: 'redis-1.internal', port: 6379 }), new Redis({ host: 'redis-2.internal', port: 6379 }), new Redis({ host: 'redis-3.internal', port: 6379 }), new Redis({ host: 'redis-4.internal', port: 6379 }), new Redis({ host: 'redis-5.internal', port: 6379 }), ]; const redlock = new Redlock(clients, { driftFactor: 0.01, // 1% clock drift factor retryCount: 3, // retry attempts retryDelay: 200, // ms between retries retryJitter: 100, // random jitter on retry automaticExtensionThreshold: 500, // extend lock when < 500ms remaining }); // Acquire lock for 10 seconds async function processPayment(paymentId: string) { let lock; try { lock = await redlock.acquire([`lock:payment:${paymentId}`], 10000); // Lock held — safe to execute await executePaymentLogic(paymentId); } catch (err) { if (err instanceof Redlock.ExecutionError) { throw new Error('Could not acquire distributed lock — payment already in progress'); } throw err; } finally { if (lock) { await lock.release(); } } } // Using the lock extension (watchdog pattern) async function processLongJob(jobId: string) { await redlock.using( [`lock:job:${jobId}`], 30000, // initial TTL async (signal) => { // redlock.using() automatically extends the lock every ~15s // signal.aborted === true if the lock could not be extended await doWork(jobId, signal); } ); }
What Redlock Guarantees
Under the assumptions the algorithm makes, Redlock provides:
Safety (mutual exclusion): At most one client holds the lock at any given time, provided:
- Clock drift across instances is bounded and small relative to the lock TTL
- Network delays are bounded
- No more than minority (< N/2) of instances fail simultaneously
Liveness (progress): The lock will eventually be released — either by the holder explicitly, or by TTL expiry on all instances.
Fault tolerance: The algorithm remains correct when up to (N-1)/2 instances fail. With 5 instances, 2 can fail and the algorithm still works.
Martin Kleppmann's Critique
In 2016, Martin Kleppmann published "How to do distributed locking" (http://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html), arguing that Redlock's safety guarantees are insufficient for strong mutual exclusion. Antirez (Salvatore Sanfilippo, Redis author) responded. The debate is worth understanding.
The Core Argument
Kleppmann's claim: Redlock assumes bounded clock drift, but real systems violate this assumption:
-
Clock jumps — NTP adjustments, VM live migration, and operator-initiated clock corrections can jump system time forward or backward by arbitrary amounts.
-
GC pauses — A process paused by GC for 30+ seconds continues executing after the pause without knowing time passed. It believes it holds a valid lock; the lock has expired and been granted to another client.
-
Network delays — A message delayed in a network buffer for longer than the lock TTL can arrive "late," causing the receiver to act on stale lock state.
The failure scenario:
T=0: Client 1 acquires Redlock with TTL=10s on 3/5 instances
T=1: Client 1 pauses (GC stop-the-world for 15 seconds)
T=11: All instances expire Client 1's lock
T=12: Client 2 acquires Redlock on 3/5 instances
T=12: Client 2 begins writing to the database
T=16: Client 1 resumes from GC pause — believes it holds the lock
T=16: Client 1 ALSO begins writing to the database
→ Two clients concurrently writing to a shared resource
This scenario happens even with a correctly implemented Redlock on correctly functioning Redis instances.
Kleppmann's conclusion: If you need strong mutual exclusion (the scenario above must never happen), Redlock is insufficient. You need a system with fencing tokens — a monotonically increasing counter that the resource uses to reject stale operations.
Antirez's Response
Antirez agreed that process pauses longer than the lock TTL can break Redlock's safety. His counterargument: this is a problem with all distributed locks, not specific to Redlock. He argued:
- For most practical use cases (job deduplication, cache stampede prevention), the consequences of occasional double execution are acceptable
- If you need fencing tokens, you need a system like ZooKeeper that provides them — Redis was not designed for this use case
- The bounded clock drift assumption is reasonable for systems where NTP is properly configured and not subject to arbitrary jumps
The Honest Synthesis
Both are correct. The question is: what do you need?
| Need | Solution |
|---|---|
| Best-effort mutual exclusion (double execution unlikely, not catastrophic) | Single-instance Redis lock or Redlock |
| Strong mutual exclusion (double execution must never happen) | ZooKeeper or etcd (provide fencing tokens) |
| High throughput lock with tolerable false positives | Redlock with short TTLs |
| Financial/ledger operations | Database row locks + idempotency keys — not Redis |
When to Use Redlock
Appropriate:
- Preventing duplicate processing of jobs when single Redis failure is a concern
- Cache stampede prevention across multiple Redis instances
- Leader election where brief periods of no leader are acceptable
- Coordination where the worst case (two holders) causes temporary inconsistency, not permanent data corruption
Not appropriate:
- Any operation where two concurrent holders cause irrecoverable data corruption
- Financial transactions, inventory deductions, ledger writes
- Systems where you cannot implement idempotency to recover from double execution
Practical guidance for most applications: A single Redis instance with SET NX PX is sufficient for 95% of distributed locking use cases. The scenarios where Redlock adds meaningful safety (Redis master fails exactly while holding a lock, before replication completes) are rare in practice. For the 5% of cases where this matters, evaluate whether ZooKeeper/etcd or database row locks are more appropriate than Redlock.
Redlock Configuration for Production
typescriptconst redlock = new Redlock(clients, { driftFactor: 0.01, // 1% of TTL as drift allowance retryCount: 5, // try 5 times retryDelay: 200, // wait 200ms between retries retryJitter: 100, // add up to 100ms random jitter automaticExtensionThreshold: 1000, // extend when < 1s remaining }); // Listen for lock extension failures redlock.on('clientError', (err) => { console.error('A redis client errored during lock operation:', err); }); // Set lock TTL generously — deduct elapsed time in the algorithm // For a job that takes up to 5s: set TTL to 30s (6x margin) const lock = await redlock.acquire(['lock:job:123'], 30000);
Summary
- Redlock acquires a lock on a quorum (N/2 + 1) of independent Redis instances — survives minority failures
- The algorithm: acquire on all instances concurrently with small per-instance timeout, check quorum + remaining validity, use if both conditions met
- Guarantees under bounded clock drift and bounded network delay: safety (one holder) + liveness (lock releases)
- Kleppmann's critique: GC pauses, clock jumps, and network delays can exceed the lock TTL, causing two clients to simultaneously believe they hold the lock — Redlock cannot prevent this
- Fencing tokens (not provided by Redis) are the correct solution for true mutual exclusion
- Use Redlock for best-effort coordination where double execution is rare and survivable
- Use ZooKeeper/etcd for strong mutual exclusion; use database row locks for financial operations
Next: A-5 — Reentrant Locks, Hierarchies, and Deadlock Prevention — advanced locking patterns including reentrant locks via Hash-stored reentry counters and consistent lock ordering to prevent circular waits.