network-programmingdistributedconsensusraft

How does Raft handle a prolonged network partition?


Consider that we are running Raft on 3 machines: A, B, C and let A be the leader. There is a network partition that splits C, from A, B. Call the current term t. A and B remain on term 2, with no additional messages besides periodic heartbeats. At this time, C enters candidate state and increments term to 3, votes for itself, times out, and repeats. After say 10 cycles, the network partition is resolved. Now the state is A[2], B[2], C[12]; C will reject AppendEntries RPC from A as the term 2 is less than its current term, 10; C cannot assemble a quorum and will continue to run the leader election protocol as a candidate, and become increasingly more divergent from the current term value of A and B.

The question is then, how does Raft (or Raft-derived implementations) handle this issue? Some thoughts I had included:

  1. Such a situation is an availability issue, rather than a safety violation. Ignore and let human operators handle by killing or resetting C
  2. Exponential backoff to decrease the divergence of C per elections
  3. Have C use lastApplied instead of currentTerm as the basis for rejecting or accepting the AppendEntries RPC. That is, we trust the log as the source of truth for terms, rather than currentTerm value. This is already used to ensure that C would not win as per the Election Restriction, however the paper seems to indicate that this "up-to-date" property is a grounds for not voting for C, but is not grounds for C to acquiesce and reset to a follower.

Note: terminology as per In Search of an Understandable Consensus Algorithm (Extended Version)


Solution

  • When C rejects an AppendEntries RPC from the leader A, it will return its now > 2 term. Raft replicas always recognize greater terms, so that in turn will cause the leader to step down and start a new election. Eventually, the cluster will converge on a new term that’s > 2 and which is >= C’s term.

    This is an oft discussed (in the Raft dev community) somewhat inconvenient scenario that can cause unnecessary churn in Raft clusters. To guard against it, the Raft dissertation — and most real-world implementations — introduce and use the so-called “pre-vote protocol.” The pre-vote protocol essentially dictates that before becoming a candidate, a follower must first determine whether it can win an election by asking its peers. In the scenario you described above, C would ask for a pre-vote from A and B, and because of the network partition it would not receive any votes. So, C would never transition to the candidate role, never increment the term, and thus never present a term > 2 after the partition heals. Thus, you’ve eliminated the churn.

    You can read more about the pre-vote protocol in Diego’s dissertation.