Module A-4·22 min read

The Redlock algorithm step by step across N independent Redis instances, what it guarantees under bounded clock drift, what Martin Kleppmann's critique gets right, fencing tokens as the correct complement, and implementation with redlock-node.

A-4 — Redlock: The Algorithm, Its Guarantees, and Its Critics

Who this module is for: You use a single Redis instance for distributed locking but want the lock to survive a Redis node failure. Redlock is Redis's multi-instance distributed lock algorithm. This module covers the algorithm step-by-step, what it actually guarantees, Martin Kleppmann's critique (the most important distributed systems analysis of Redis locking), and when Redlock is and is not the right tool.


The Problem with Single-Instance Locking

A single-instance Redis lock has one critical failure mode: if the Redis master fails after granting a lock but before the lock holder releases it, and a failover promotes a replica, the replica does not have the lock (replication is asynchronous — the SET NX may not have replicated before the primary failed). The new primary issues the lock to a new client. Now two clients hold the same lock simultaneously.

Redlock addresses this by requiring a lock to be acquired on a majority of independent Redis instances.


The Redlock Algorithm

Redlock requires an odd number of independent Redis instances — typically 5. "Independent" means separate machines with no replication between them. A failure of any minority (< N/2) of instances does not affect lock correctness.

Step-by-Step

N = 5 independent Redis instances
Validity time = 30 seconds (desired lock TTL)

1. Record the start time: T1 = current time in milliseconds

2. For each of the N instances, attempt to acquire the lock:
   SET lock:resource {uuid} NX PX {validity_ms}
   Use a small per-instance timeout (e.g., 5–50ms) to avoid blocking if an instance is down.

3. Count successful acquisitions. Calculate elapsed time: T2 = current time
   Elapsed = T2 - T1
   
4. The lock is acquired if:
   a. Acquired on at least (N/2 + 1) = 3 of 5 instances (quorum)
   b. Remaining validity time > 0: (validity_ms - elapsed) > 0

5. If the lock is acquired:
   The "actual TTL" = validity_ms - elapsed - clock_drift_factor
   The holder can use the resource for at most this remaining time.

6. If the lock was NOT acquired (fewer than quorum, or time expired):
   Release the lock on all instances where it was acquired (to clean up).
   Wait a random backoff time before retrying.

Why Quorum?

If 3 of 5 instances grant the lock, no other client can simultaneously get quorum — at best, they see 2 instances (the 2 that failed or are unreachable). 2 < 3, so they cannot acquire the lock. The majority ensures safety.

Why Deduct Elapsed Time?

The lock is set on each instance with the full validity_ms. But by the time the algorithm finishes acquiring on all instances, some time has passed. The actual safe window for holding the lock is reduced by this elapsed time plus a small clock drift factor.


Implementation: redlock-node

bash
npm install redlock
typescript
import Redlock from 'redlock'; import Redis from 'ioredis'; // 5 independent Redis instances const clients = [ new Redis({ host: 'redis-1.internal', port: 6379 }), new Redis({ host: 'redis-2.internal', port: 6379 }), new Redis({ host: 'redis-3.internal', port: 6379 }), new Redis({ host: 'redis-4.internal', port: 6379 }), new Redis({ host: 'redis-5.internal', port: 6379 }), ]; const redlock = new Redlock(clients, { driftFactor: 0.01, // 1% clock drift factor retryCount: 3, // retry attempts retryDelay: 200, // ms between retries retryJitter: 100, // random jitter on retry automaticExtensionThreshold: 500, // extend lock when < 500ms remaining }); // Acquire lock for 10 seconds async function processPayment(paymentId: string) { let lock; try { lock = await redlock.acquire([`lock:payment:${paymentId}`], 10000); // Lock held — safe to execute await executePaymentLogic(paymentId); } catch (err) { if (err instanceof Redlock.ExecutionError) { throw new Error('Could not acquire distributed lock — payment already in progress'); } throw err; } finally { if (lock) { await lock.release(); } } } // Using the lock extension (watchdog pattern) async function processLongJob(jobId: string) { await redlock.using( [`lock:job:${jobId}`], 30000, // initial TTL async (signal) => { // redlock.using() automatically extends the lock every ~15s // signal.aborted === true if the lock could not be extended await doWork(jobId, signal); } ); }

What Redlock Guarantees

Under the assumptions the algorithm makes, Redlock provides:

Safety (mutual exclusion): At most one client holds the lock at any given time, provided:

  • Clock drift across instances is bounded and small relative to the lock TTL
  • Network delays are bounded
  • No more than minority (< N/2) of instances fail simultaneously

Liveness (progress): The lock will eventually be released — either by the holder explicitly, or by TTL expiry on all instances.

Fault tolerance: The algorithm remains correct when up to (N-1)/2 instances fail. With 5 instances, 2 can fail and the algorithm still works.


Martin Kleppmann's Critique

In 2016, Martin Kleppmann published "How to do distributed locking" (http://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html), arguing that Redlock's safety guarantees are insufficient for strong mutual exclusion. Antirez (Salvatore Sanfilippo, Redis author) responded. The debate is worth understanding.

The Core Argument

Kleppmann's claim: Redlock assumes bounded clock drift, but real systems violate this assumption:

  1. Clock jumps — NTP adjustments, VM live migration, and operator-initiated clock corrections can jump system time forward or backward by arbitrary amounts.

  2. GC pauses — A process paused by GC for 30+ seconds continues executing after the pause without knowing time passed. It believes it holds a valid lock; the lock has expired and been granted to another client.

  3. Network delays — A message delayed in a network buffer for longer than the lock TTL can arrive "late," causing the receiver to act on stale lock state.

The failure scenario:

T=0: Client 1 acquires Redlock with TTL=10s on 3/5 instances
T=1: Client 1 pauses (GC stop-the-world for 15 seconds)
T=11: All instances expire Client 1's lock
T=12: Client 2 acquires Redlock on 3/5 instances
T=12: Client 2 begins writing to the database
T=16: Client 1 resumes from GC pause — believes it holds the lock
T=16: Client 1 ALSO begins writing to the database
      → Two clients concurrently writing to a shared resource

This scenario happens even with a correctly implemented Redlock on correctly functioning Redis instances.

Kleppmann's conclusion: If you need strong mutual exclusion (the scenario above must never happen), Redlock is insufficient. You need a system with fencing tokens — a monotonically increasing counter that the resource uses to reject stale operations.

Antirez's Response

Antirez agreed that process pauses longer than the lock TTL can break Redlock's safety. His counterargument: this is a problem with all distributed locks, not specific to Redlock. He argued:

  • For most practical use cases (job deduplication, cache stampede prevention), the consequences of occasional double execution are acceptable
  • If you need fencing tokens, you need a system like ZooKeeper that provides them — Redis was not designed for this use case
  • The bounded clock drift assumption is reasonable for systems where NTP is properly configured and not subject to arbitrary jumps

The Honest Synthesis

Both are correct. The question is: what do you need?

NeedSolution
Best-effort mutual exclusion (double execution unlikely, not catastrophic)Single-instance Redis lock or Redlock
Strong mutual exclusion (double execution must never happen)ZooKeeper or etcd (provide fencing tokens)
High throughput lock with tolerable false positivesRedlock with short TTLs
Financial/ledger operationsDatabase row locks + idempotency keys — not Redis

When to Use Redlock

Appropriate:

  • Preventing duplicate processing of jobs when single Redis failure is a concern
  • Cache stampede prevention across multiple Redis instances
  • Leader election where brief periods of no leader are acceptable
  • Coordination where the worst case (two holders) causes temporary inconsistency, not permanent data corruption

Not appropriate:

  • Any operation where two concurrent holders cause irrecoverable data corruption
  • Financial transactions, inventory deductions, ledger writes
  • Systems where you cannot implement idempotency to recover from double execution

Practical guidance for most applications: A single Redis instance with SET NX PX is sufficient for 95% of distributed locking use cases. The scenarios where Redlock adds meaningful safety (Redis master fails exactly while holding a lock, before replication completes) are rare in practice. For the 5% of cases where this matters, evaluate whether ZooKeeper/etcd or database row locks are more appropriate than Redlock.


Redlock Configuration for Production

typescript
const redlock = new Redlock(clients, { driftFactor: 0.01, // 1% of TTL as drift allowance retryCount: 5, // try 5 times retryDelay: 200, // wait 200ms between retries retryJitter: 100, // add up to 100ms random jitter automaticExtensionThreshold: 1000, // extend when < 1s remaining }); // Listen for lock extension failures redlock.on('clientError', (err) => { console.error('A redis client errored during lock operation:', err); }); // Set lock TTL generously — deduct elapsed time in the algorithm // For a job that takes up to 5s: set TTL to 30s (6x margin) const lock = await redlock.acquire(['lock:job:123'], 30000);

Summary

  • Redlock acquires a lock on a quorum (N/2 + 1) of independent Redis instances — survives minority failures
  • The algorithm: acquire on all instances concurrently with small per-instance timeout, check quorum + remaining validity, use if both conditions met
  • Guarantees under bounded clock drift and bounded network delay: safety (one holder) + liveness (lock releases)
  • Kleppmann's critique: GC pauses, clock jumps, and network delays can exceed the lock TTL, causing two clients to simultaneously believe they hold the lock — Redlock cannot prevent this
  • Fencing tokens (not provided by Redis) are the correct solution for true mutual exclusion
  • Use Redlock for best-effort coordination where double execution is rare and survivable
  • Use ZooKeeper/etcd for strong mutual exclusion; use database row locks for financial operations

Next: A-5 — Reentrant Locks, Hierarchies, and Deadlock Prevention — advanced locking patterns including reentrant locks via Hash-stored reentry counters and consistent lock ordering to prevent circular waits.

© 2026 Jatin Jain Saraf (JJS). All rights reserved.