Module A-11·18 min read

How cluster nodes propagate topology changes via PING/PONG gossip, cluster-node-timeout in failure detection, split-brain handling with cluster-require-full-coverage, and when a Cluster sacrifices availability for consistency.

A-11 — Gossip Protocol and Network Partition Handling

Who this module is for: You run a Redis Cluster and want to understand how nodes discover each other, propagate topology changes, and handle scenarios where part of the cluster becomes unreachable. The gossip protocol is the nervous system of Redis Cluster — understanding it explains why certain failure modes occur and how to configure the cluster to balance availability against consistency.

The Gossip Protocol

In Redis Cluster, every node knows about every other node. There is no central registry. Instead, nodes maintain this knowledge through a gossip protocol: periodic exchange of cluster state information.

Every cluster-node-timeout / 2 milliseconds, each node sends a PING message to a random selection of other nodes. Each PING carries:

The sender's view of the cluster topology (node IDs, IP addresses, slot assignments, node states)
Information about nodes the sender considers potentially or definitely failed

Recipients respond with PONG messages carrying the same type of information. Through this continuous gossip, all nodes converge on a consistent view of the cluster topology.

cluster-node-timeout 15000   → default: 15 seconds
# PING interval ≈ 7.5 seconds (timeout / 2)

Node States: PFAIL and FAIL

PFAIL (Probable Failure)

A node marks another node as PFAIL (Probable Failure) if it does not receive a PONG response within cluster-node-timeout:

Node A sends PING to Node B
Node A waits cluster-node-timeout (15s)
No PONG received
Node A marks Node B as PFAIL

PFAIL is a soft state — a single node's suspicion. It could be caused by:

Node B genuinely crashed
Network partition between Node A and Node B (but Node B is fine from other nodes' perspective)
Node B is overloaded and slow to respond

FAIL (Definite Failure)

PFAIL becomes FAIL (Definite Failure) when a majority of primary nodes agree that a node is unreachable. This agreement propagates via gossip:

Node A: Node B is PFAIL
Node A gossips this to Node C, Node D, Node E (includes PFAIL flag in PING)

Node C: "Node A says B is PFAIL" + "I also can't reach B" → Node B is PFAIL for me too
Node D: "Node A says B is PFAIL, Node C says B is PFAIL" → majority agrees → Node B is FAIL

Node D propagates FAIL state via gossip
All nodes receive FAIL state for Node B within seconds

The FAIL transition is irreversible until the node comes back online and communicates with the cluster.

Failover Election

When a primary node is declared FAIL, its replica(s) detect this via gossip and initiate a failover election:

The replica waits a delay proportional to its replication lag (less-lagged replicas go first)
The replica broadcasts FAILOVER_AUTH_REQUEST to all primary nodes
Primary nodes vote: they grant one vote per failover epoch; the first replica to ask in this epoch gets the vote
If a replica receives votes from a majority of primaries: it promotes itself
The new primary sends PONG packets with its updated slot ownership

Failover completes in approximately 1–2 × cluster-node-timeout.

Network Partitions: CAP Trade-offs

A network partition divides the cluster into isolated segments. Redis Cluster must choose between consistency (only one segment can accept writes) and availability (both segments continue accepting writes, potentially diverging).

Minority Partition

If a partition isolates a minority of nodes (fewer primary nodes than quorum), those nodes eventually stop accepting writes:

3-node cluster: nodes A, B, C
Partition: A isolated, B+C together

Node A: "I can't reach B or C" → B and C are PFAIL
        After timeout: Cannot reach majority → enters cluster_state:fail
        Node A: stops accepting writes

Node B + C: Node A is PFAIL → FAIL
            Node A's replica (if any) on B+C side promotes → cluster continues

Node A entering fail state is the correct behaviour — it protects against writing data that would be lost when the partition heals and Node A rejoins as a replica.

Majority Partition

If the primary nodes are split with majority on one side and minority on the other:

5-primary cluster: A B C D E
Partition: A+B vs C+D+E

A+B side: cannot reach C, D, E → they are PFAIL
          A+B cannot reach majority → eventually stop accepting writes

C+D+E side: cannot reach A, B → PFAIL → FAIL for A and B
            A and B's replicas on C+D+E side promote
            C+D+E side continues serving with full slot coverage

The cluster-require-full-coverage Setting

cluster-require-full-coverage yes   → default

With yes: if any slot has no reachable primary, the entire cluster stops accepting commands for all keys — not just the affected slots. This maximises consistency at the cost of availability.

With no: the cluster continues serving keys for slots that do have a reachable primary. Slots without a reachable primary return errors, but other slots continue working. This maximises availability at the cost of partial availability.

Trade-off decision:

yes: prefer consistency — no reads/writes during partial failure
no: prefer availability — partial reads/writes continue; clients must handle CLUSTERDOWN errors for unavailable slots

Gossip Topology Propagation

When a node joins or leaves the cluster, or when slot assignments change, this information propagates via gossip within a few round-trip times:

New node joins → sends MEET to one existing node
Existing node: "New node has joined" → includes in next PING messages
All nodes receive information about new node within seconds

Slots reassigned during resharding propagate similarly — each CLUSTER SETSLOT NODE call is included in gossip messages until all nodes have updated their slot maps.

Convergence time: In a 6-node cluster with cluster-node-timeout 15s, topology changes propagate to all nodes within 1–2 seconds. In larger clusters (100+ nodes), convergence takes proportionally longer due to the O(N) gossip messages required.

CLUSTER NODES: Reading the Topology

CLUSTER NODES

Each line represents one node:

<id> <ip:port@cport[,hostname]> <flags> <master> <ping-sent> <pong-recv> <config-epoch> <link-state> <slot>...

a3f8b2c7 10.0.1.50:6379@16379 master - 0 1717000000 1 connected 0-5460
b9e1d4f5 10.0.1.51:6379@16379 master - 0 1717000000 2 connected 5461-10922
c6a2e3d8 10.0.1.52:6379@16379 master - 0 1717000000 3 connected 10923-16383
d1f4a9b2 10.0.1.53:6379@16379 slave a3f8b2c7 0 1717000000 1 connected
e5c8d7f1 10.0.1.54:6379@16379 slave b9e1d4f5 0 1717000000 2 connected
f2b6e4a9 10.0.1.55:6379@16379 slave c6a2e3d8 0 1717000000 3 connected

Flags:

master — primary node
slave — replica node
fail? — PFAIL (suspected failure)
fail — FAIL (confirmed failure)
handshake — node recently added, still in handshake
noaddr — no address known

Slot ranges: Listed for primary nodes. 10923-16383 means the node owns slots 10923 to 16383.

Manual Failover

Trigger a manual failover (useful for planned maintenance):

bash
# On the replica you want to promote:
CLUSTER FAILOVER [FORCE | TAKEOVER]

Without options (graceful failover):

Replica asks the primary to pause accepting writes
Primary waits for the replica to catch up in replication
Primary transfers its epoch to the replica
Replica promotes — no data loss

FORCE: Primary is unreachable but you want to force the replica to promote without waiting for the primary's cooperation.

TAKEOVER: Promote without even requesting votes from other primaries. Use only when the cluster has lost quorum and you need to recover a partition manually. Dangerous — can cause split-brain.

Tuning Cluster Timing

cluster-node-timeout 15000     → detection time (default 15s)
cluster-migration-barrier 1    → minimum replicas a primary must have before a replica can migrate
cluster-allow-reads-when-down no → allow replicas to serve reads when cluster is in fail state

Lower cluster-node-timeout:

Faster failure detection and failover
Higher risk of false positives on brief network hiccups
More gossip traffic (PINGs sent more frequently)
Recommended minimum: 5000ms (5 seconds)

Higher cluster-node-timeout:

Slower failure detection and failover (more downtime during primary failure)
More resilient to brief network delays
Less gossip traffic

Summary

Gossip protocol: nodes send PING to random peers every timeout/2; PONG responses carry cluster topology
PFAIL: one node's suspicion (no PONG within timeout)
FAIL: majority agreement via gossip → triggers failover
Failover: replica detects FAIL → broadcasts election request → receives majority votes → promotes → propagates new topology via gossip
Network partitions: minority side eventually enters cluster_state:fail and stops accepting writes — correct behaviour prevents data divergence
cluster-require-full-coverage yes (default): entire cluster stops if any slot has no primary — prefer consistency
cluster-require-full-coverage no: cluster serves available slots — prefer partial availability
CLUSTER NODES shows full topology with node states, slot assignments, and link state
Graceful manual failover via CLUSTER FAILOVER on the target replica — zero data loss

Next: A-12 — Multi-Region Redis: Active-Active and Geo-Replication — distributing Redis across regions, CRDT-based conflict resolution, and the latency-vs-consistency trade-offs of global distribution.

PreviousModule A-10: Resharding, Node Addition, and Live Slot Migration Next Module A-12: Multi-Region Redis: Active-Active and Geo-Replication