Module A-4·22 min read

The Redlock algorithm step by step across N independent Redis instances, what it guarantees under bounded clock drift, what Martin Kleppmann's critique gets right, fencing tokens as the correct complement, and implementation with redlock-node.

A-4 — Redlock: The Algorithm, Its Guarantees, and Its Critics

Who this module is for: You use a single Redis instance for distributed locking but want the lock to survive a Redis node failure. Redlock is Redis's multi-instance distributed lock algorithm. This module covers the algorithm step-by-step, what it actually guarantees, Martin Kleppmann's critique (the most important distributed systems analysis of Redis locking), and when Redlock is and is not the right tool.

The Problem with Single-Instance Locking

A single-instance Redis lock has one critical failure mode: if the Redis master fails after granting a lock but before the lock holder releases it, and a failover promotes a replica, the replica does not have the lock (replication is asynchronous — the SET NX may not have replicated before the primary failed). The new primary issues the lock to a new client. Now two clients hold the same lock simultaneously.

Redlock addresses this by requiring a lock to be acquired on a majority of independent Redis instances.

The Redlock Algorithm

Redlock requires an odd number of independent Redis instances — typically 5. "Independent" means separate machines with no replication between them. A failure of any minority (< N/2) of instances does not affect lock correctness.

Step-by-Step

N = 5 independent Redis instances
Validity time = 30 seconds (desired lock TTL)

1. Record the start time: T1 = current time in milliseconds

2. For each of the N instances, attempt to acquire the lock:
   SET lock:resource {uuid} NX PX {validity_ms}
   Use a small per-instance timeout (e.g., 5–50ms) to avoid blocking if an instance is down.

3. Count successful acquisitions. Calculate elapsed time: T2 = current time
   Elapsed = T2 - T1
   
4. The lock is acquired if:
   a. Acquired on at least (N/2 + 1) = 3 of 5 instances (quorum)
   b. Remaining validity time > 0: (validity_ms - elapsed) > 0

5. If the lock is acquired:
   The "actual TTL" = validity_ms - elapsed - clock_drift_factor
   The holder can use the resource for at most this remaining time.

6. If the lock was NOT acquired (fewer than quorum, or time expired):
   Release the lock on all instances where it was acquired (to clean up).
   Wait a random backoff time before retrying.

Why Quorum?

If 3 of 5 instances grant the lock, no other client can simultaneously get quorum — at best, they see 2 instances (the 2 that failed or are unreachable). 2 < 3, so they cannot acquire the lock. The majority ensures safety.

Why Deduct Elapsed Time?

The lock is set on each instance with the full validity_ms. But by the time the algorithm finishes acquiring on all instances, some time has passed. The actual safe window for holding the lock is reduced by this elapsed time plus a small clock drift factor.

Implementation: redlock-node

bash
npm install redlock

typescript
import Redlock from 'redlock';
import Redis from 'ioredis';

// 5 independent Redis instances
const clients = [
  new Redis({ host: 'redis-1.internal', port: 6379 }),
  new Redis({ host: 'redis-2.internal', port: 6379 }),
  new Redis({ host: 'redis-3.internal', port: 6379 }),
  new Redis({ host: 'redis-4.internal', port: 6379 }),
  new Redis({ host: 'redis-5.internal', port: 6379 }),
];

const redlock = new Redlock(clients, {
  driftFactor: 0.01,        // 1% clock drift factor
  retryCount: 3,            // retry attempts
  retryDelay: 200,          // ms between retries
  retryJitter: 100,         // random jitter on retry
  automaticExtensionThreshold: 500,  // extend lock when < 500ms remaining
});

// Acquire lock for 10 seconds
async function processPayment(paymentId: string) {
  let lock;
  try {
    lock = await redlock.acquire([`lock:payment:${paymentId}`], 10000);
    // Lock held — safe to execute
    await executePaymentLogic(paymentId);
  } catch (err) {
    if (err instanceof Redlock.ExecutionError) {
      throw new Error('Could not acquire distributed lock — payment already in progress');
    }
    throw err;
  } finally {
    if (lock) {
      await lock.release();
    }
  }
}

// Using the lock extension (watchdog pattern)
async function processLongJob(jobId: string) {
  await redlock.using(
    [`lock:job:${jobId}`],
    30000,  // initial TTL
    async (signal) => {
      // redlock.using() automatically extends the lock every ~15s
      // signal.aborted === true if the lock could not be extended
      await doWork(jobId, signal);
    }
  );
}

What Redlock Guarantees

Under the assumptions the algorithm makes, Redlock provides:

Safety (mutual exclusion): At most one client holds the lock at any given time, provided:

Clock drift across instances is bounded and small relative to the lock TTL
Network delays are bounded
No more than minority (< N/2) of instances fail simultaneously

Liveness (progress): The lock will eventually be released — either by the holder explicitly, or by TTL expiry on all instances.

Fault tolerance: The algorithm remains correct when up to (N-1)/2 instances fail. With 5 instances, 2 can fail and the algorithm still works.

Martin Kleppmann's Critique

In 2016, Martin Kleppmann published "How to do distributed locking" (http://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html), arguing that Redlock's safety guarantees are insufficient for strong mutual exclusion. Antirez (Salvatore Sanfilippo, Redis author) responded. The debate is worth understanding.

The Core Argument

Kleppmann's claim: Redlock assumes bounded clock drift, but real systems violate this assumption:

Clock jumps — NTP adjustments, VM live migration, and operator-initiated clock corrections can jump system time forward or backward by arbitrary amounts.
GC pauses — A process paused by GC for 30+ seconds continues executing after the pause without knowing time passed. It believes it holds a valid lock; the lock has expired and been granted to another client.
Network delays — A message delayed in a network buffer for longer than the lock TTL can arrive "late," causing the receiver to act on stale lock state.

The failure scenario:

T=0: Client 1 acquires Redlock with TTL=10s on 3/5 instances
T=1: Client 1 pauses (GC stop-the-world for 15 seconds)
T=11: All instances expire Client 1's lock
T=12: Client 2 acquires Redlock on 3/5 instances
T=12: Client 2 begins writing to the database
T=16: Client 1 resumes from GC pause — believes it holds the lock
T=16: Client 1 ALSO begins writing to the database
      → Two clients concurrently writing to a shared resource

This scenario happens even with a correctly implemented Redlock on correctly functioning Redis instances.

Kleppmann's conclusion: If you need strong mutual exclusion (the scenario above must never happen), Redlock is insufficient. You need a system with fencing tokens — a monotonically increasing counter that the resource uses to reject stale operations.

Antirez's Response

Antirez agreed that process pauses longer than the lock TTL can break Redlock's safety. His counterargument: this is a problem with all distributed locks, not specific to Redlock. He argued:

For most practical use cases (job deduplication, cache stampede prevention), the consequences of occasional double execution are acceptable
If you need fencing tokens, you need a system like ZooKeeper that provides them — Redis was not designed for this use case
The bounded clock drift assumption is reasonable for systems where NTP is properly configured and not subject to arbitrary jumps

The Honest Synthesis

Both are correct. The question is: what do you need?

Need	Solution
Best-effort mutual exclusion (double execution unlikely, not catastrophic)	Single-instance Redis lock or Redlock
Strong mutual exclusion (double execution must never happen)	ZooKeeper or etcd (provide fencing tokens)
High throughput lock with tolerable false positives	Redlock with short TTLs
Financial/ledger operations	Database row locks + idempotency keys — not Redis

When to Use Redlock

Appropriate:

Preventing duplicate processing of jobs when single Redis failure is a concern
Cache stampede prevention across multiple Redis instances
Leader election where brief periods of no leader are acceptable
Coordination where the worst case (two holders) causes temporary inconsistency, not permanent data corruption

Not appropriate:

Any operation where two concurrent holders cause irrecoverable data corruption
Financial transactions, inventory deductions, ledger writes
Systems where you cannot implement idempotency to recover from double execution

Practical guidance for most applications: A single Redis instance with SET NX PX is sufficient for 95% of distributed locking use cases. The scenarios where Redlock adds meaningful safety (Redis master fails exactly while holding a lock, before replication completes) are rare in practice. For the 5% of cases where this matters, evaluate whether ZooKeeper/etcd or database row locks are more appropriate than Redlock.

Redlock Configuration for Production

typescript
const redlock = new Redlock(clients, {
  driftFactor: 0.01,                     // 1% of TTL as drift allowance
  retryCount: 5,                         // try 5 times
  retryDelay: 200,                       // wait 200ms between retries
  retryJitter: 100,                      // add up to 100ms random jitter
  automaticExtensionThreshold: 1000,     // extend when < 1s remaining
});

// Listen for lock extension failures
redlock.on('clientError', (err) => {
  console.error('A redis client errored during lock operation:', err);
});

// Set lock TTL generously — deduct elapsed time in the algorithm
// For a job that takes up to 5s: set TTL to 30s (6x margin)
const lock = await redlock.acquire(['lock:job:123'], 30000);

Summary

Redlock acquires a lock on a quorum (N/2 + 1) of independent Redis instances — survives minority failures
The algorithm: acquire on all instances concurrently with small per-instance timeout, check quorum + remaining validity, use if both conditions met
Guarantees under bounded clock drift and bounded network delay: safety (one holder) + liveness (lock releases)
Kleppmann's critique: GC pauses, clock jumps, and network delays can exceed the lock TTL, causing two clients to simultaneously believe they hold the lock — Redlock cannot prevent this
Fencing tokens (not provided by Redis) are the correct solution for true mutual exclusion
Use Redlock for best-effort coordination where double execution is rare and survivable
Use ZooKeeper/etcd for strong mutual exclusion; use database row locks for financial operations

Next: A-5 — Reentrant Locks, Hierarchies, and Deadlock Prevention — advanced locking patterns including reentrant locks via Hash-stored reentry counters and consistent lock ordering to prevent circular waits.

PreviousModule A-3: Redis Functions: Persistent Stored Procedures Next Module A-5: Reentrant Locks, Hierarchies, and Deadlock Prevention