How cluster nodes propagate topology changes via PING/PONG gossip, cluster-node-timeout in failure detection, split-brain handling with cluster-require-full-coverage, and when a Cluster sacrifices availability for consistency.
A-11 — Gossip Protocol and Network Partition Handling
Who this module is for: You run a Redis Cluster and want to understand how nodes discover each other, propagate topology changes, and handle scenarios where part of the cluster becomes unreachable. The gossip protocol is the nervous system of Redis Cluster — understanding it explains why certain failure modes occur and how to configure the cluster to balance availability against consistency.
The Gossip Protocol
In Redis Cluster, every node knows about every other node. There is no central registry. Instead, nodes maintain this knowledge through a gossip protocol: periodic exchange of cluster state information.
Every cluster-node-timeout / 2 milliseconds, each node sends a PING message to a random selection of other nodes. Each PING carries:
- The sender's view of the cluster topology (node IDs, IP addresses, slot assignments, node states)
- Information about nodes the sender considers potentially or definitely failed
Recipients respond with PONG messages carrying the same type of information. Through this continuous gossip, all nodes converge on a consistent view of the cluster topology.
cluster-node-timeout 15000 → default: 15 seconds
# PING interval ≈ 7.5 seconds (timeout / 2)
Node States: PFAIL and FAIL
PFAIL (Probable Failure)
A node marks another node as PFAIL (Probable Failure) if it does not receive a PONG response within cluster-node-timeout:
Node A sends PING to Node B
Node A waits cluster-node-timeout (15s)
No PONG received
Node A marks Node B as PFAIL
PFAIL is a soft state — a single node's suspicion. It could be caused by:
- Node B genuinely crashed
- Network partition between Node A and Node B (but Node B is fine from other nodes' perspective)
- Node B is overloaded and slow to respond
FAIL (Definite Failure)
PFAIL becomes FAIL (Definite Failure) when a majority of primary nodes agree that a node is unreachable. This agreement propagates via gossip:
Node A: Node B is PFAIL
Node A gossips this to Node C, Node D, Node E (includes PFAIL flag in PING)
Node C: "Node A says B is PFAIL" + "I also can't reach B" → Node B is PFAIL for me too
Node D: "Node A says B is PFAIL, Node C says B is PFAIL" → majority agrees → Node B is FAIL
Node D propagates FAIL state via gossip
All nodes receive FAIL state for Node B within seconds
The FAIL transition is irreversible until the node comes back online and communicates with the cluster.
Failover Election
When a primary node is declared FAIL, its replica(s) detect this via gossip and initiate a failover election:
- The replica waits a delay proportional to its replication lag (less-lagged replicas go first)
- The replica broadcasts
FAILOVER_AUTH_REQUESTto all primary nodes - Primary nodes vote: they grant one vote per failover epoch; the first replica to ask in this epoch gets the vote
- If a replica receives votes from a majority of primaries: it promotes itself
- The new primary sends
PONGpackets with its updated slot ownership
Failover completes in approximately 1–2 × cluster-node-timeout.
Network Partitions: CAP Trade-offs
A network partition divides the cluster into isolated segments. Redis Cluster must choose between consistency (only one segment can accept writes) and availability (both segments continue accepting writes, potentially diverging).
Minority Partition
If a partition isolates a minority of nodes (fewer primary nodes than quorum), those nodes eventually stop accepting writes:
3-node cluster: nodes A, B, C
Partition: A isolated, B+C together
Node A: "I can't reach B or C" → B and C are PFAIL
After timeout: Cannot reach majority → enters cluster_state:fail
Node A: stops accepting writes
Node B + C: Node A is PFAIL → FAIL
Node A's replica (if any) on B+C side promotes → cluster continues
Node A entering fail state is the correct behaviour — it protects against writing data that would be lost when the partition heals and Node A rejoins as a replica.
Majority Partition
If the primary nodes are split with majority on one side and minority on the other:
5-primary cluster: A B C D E
Partition: A+B vs C+D+E
A+B side: cannot reach C, D, E → they are PFAIL
A+B cannot reach majority → eventually stop accepting writes
C+D+E side: cannot reach A, B → PFAIL → FAIL for A and B
A and B's replicas on C+D+E side promote
C+D+E side continues serving with full slot coverage
The cluster-require-full-coverage Setting
cluster-require-full-coverage yes → default
With yes: if any slot has no reachable primary, the entire cluster stops accepting commands for all keys — not just the affected slots. This maximises consistency at the cost of availability.
With no: the cluster continues serving keys for slots that do have a reachable primary. Slots without a reachable primary return errors, but other slots continue working. This maximises availability at the cost of partial availability.
Trade-off decision:
yes: prefer consistency — no reads/writes during partial failureno: prefer availability — partial reads/writes continue; clients must handleCLUSTERDOWNerrors for unavailable slots
Gossip Topology Propagation
When a node joins or leaves the cluster, or when slot assignments change, this information propagates via gossip within a few round-trip times:
New node joins → sends MEET to one existing node
Existing node: "New node has joined" → includes in next PING messages
All nodes receive information about new node within seconds
Slots reassigned during resharding propagate similarly — each CLUSTER SETSLOT NODE call is included in gossip messages until all nodes have updated their slot maps.
Convergence time: In a 6-node cluster with cluster-node-timeout 15s, topology changes propagate to all nodes within 1–2 seconds. In larger clusters (100+ nodes), convergence takes proportionally longer due to the O(N) gossip messages required.
CLUSTER NODES: Reading the Topology
CLUSTER NODES
Each line represents one node:
<id> <ip:port@cport[,hostname]> <flags> <master> <ping-sent> <pong-recv> <config-epoch> <link-state> <slot>...
a3f8b2c7 10.0.1.50:6379@16379 master - 0 1717000000 1 connected 0-5460
b9e1d4f5 10.0.1.51:6379@16379 master - 0 1717000000 2 connected 5461-10922
c6a2e3d8 10.0.1.52:6379@16379 master - 0 1717000000 3 connected 10923-16383
d1f4a9b2 10.0.1.53:6379@16379 slave a3f8b2c7 0 1717000000 1 connected
e5c8d7f1 10.0.1.54:6379@16379 slave b9e1d4f5 0 1717000000 2 connected
f2b6e4a9 10.0.1.55:6379@16379 slave c6a2e3d8 0 1717000000 3 connected
Flags:
master— primary nodeslave— replica nodefail?— PFAIL (suspected failure)fail— FAIL (confirmed failure)handshake— node recently added, still in handshakenoaddr— no address known
Slot ranges: Listed for primary nodes. 10923-16383 means the node owns slots 10923 to 16383.
Manual Failover
Trigger a manual failover (useful for planned maintenance):
bash# On the replica you want to promote: CLUSTER FAILOVER [FORCE | TAKEOVER]
Without options (graceful failover):
- Replica asks the primary to pause accepting writes
- Primary waits for the replica to catch up in replication
- Primary transfers its epoch to the replica
- Replica promotes — no data loss
FORCE: Primary is unreachable but you want to force the replica to promote without waiting for the primary's cooperation.
TAKEOVER: Promote without even requesting votes from other primaries. Use only when the cluster has lost quorum and you need to recover a partition manually. Dangerous — can cause split-brain.
Tuning Cluster Timing
cluster-node-timeout 15000 → detection time (default 15s)
cluster-migration-barrier 1 → minimum replicas a primary must have before a replica can migrate
cluster-allow-reads-when-down no → allow replicas to serve reads when cluster is in fail state
Lower cluster-node-timeout:
- Faster failure detection and failover
- Higher risk of false positives on brief network hiccups
- More gossip traffic (PINGs sent more frequently)
- Recommended minimum: 5000ms (5 seconds)
Higher cluster-node-timeout:
- Slower failure detection and failover (more downtime during primary failure)
- More resilient to brief network delays
- Less gossip traffic
Summary
- Gossip protocol: nodes send PING to random peers every
timeout/2; PONG responses carry cluster topology - PFAIL: one node's suspicion (no PONG within timeout)
- FAIL: majority agreement via gossip → triggers failover
- Failover: replica detects FAIL → broadcasts election request → receives majority votes → promotes → propagates new topology via gossip
- Network partitions: minority side eventually enters
cluster_state:failand stops accepting writes — correct behaviour prevents data divergence cluster-require-full-coverage yes(default): entire cluster stops if any slot has no primary — prefer consistencycluster-require-full-coverage no: cluster serves available slots — prefer partial availabilityCLUSTER NODESshows full topology with node states, slot assignments, and link state- Graceful manual failover via
CLUSTER FAILOVERon the target replica — zero data loss
Next: A-12 — Multi-Region Redis: Active-Active and Geo-Replication — distributing Redis across regions, CRDT-based conflict resolution, and the latency-vs-consistency trade-offs of global distribution.