apache-kafkaapache-zookeeperkafka-cluster

Kafka : Quorum-based approach to elect the new leader?


There are two common strategies for keeping replicas in sync, primary-backup replication and quorum-based replication as stated here

In primary-backup replication, the leader waits until the write completes on every replica in the group before acknowledging the client. If one of the replicas is down, the leader drops it from the current group and continues to write to the remaining replicas. A failed replica is allowed to rejoin the group if it comes back and catches up with the leader. With f replicas, primary-backup replication can tolerate f-1 failures.

In the quorum-based approach, the leader waits until a write completes on a majority of the replicas. The size of the replica group doesn’t change even when some replicas are down. If there are 2f+1 replicas, quorum-based replication can tolerate f replica failures. If the leader fails, it needs at least f+1 replicas to elect a new leader.

I have a question about the statement If the leader fails, it needs at least f+1 replicas to elect a new leader in quorum based approach. My question is why quorum(majority) of at f+1 replicas is required to elect a new leader ? Why not any replica out of f+1 in-synch-replica(ISR) is selected ? Why do we need election instead of just simple any selection?

For election, how does zookeeper elect the final leader out of remaining replicas ? Does it compare which replica is latest updated ? Also why do I need the uneven number(say 3) of zookeper to elect a leader instead even number(say 2) ?


Solution

  • Also why do I need the uneven number(say 3) of zookeper to elect a leader instead even number(say 2) ?

    In a quorum based system like zookeeper, a leader election requires a simple majority out of an "ensemble" - ie, nodes which form zK cluster. So for a 3 node ensemble, one node failure could be tolerated if remaining two were to form a new ensemble and remain operational. On the other hand, in a four node ensemble also, you need at-least 3 nodes alive to form a majority, so it could tolerate only 1 node failure. A five node ensemble on the other hand could tolerate 2 node failures.

    Now you see that a 3 node or 4 node cluster could effectively tolerate only 1 node failure, so it make sense to have an odd number of nodes to maximise number of nodes which could be down for a given cluster.

    zK leader election relies on a Paxos like protocol called ZAB. Every write goes through the leader and leader generates a transaction id (zxid) and assigns it to each write request. The id represent the order in which the writes are applied on all replicas. A write is considered successful if the leader receives the ack from the majority. An explanation of ZAB.

    My question is why quorum(majority) of at f+1 replicas is required to elect a new leader ? Why not any replica out of f+1 in-synch-replica(ISR) is selected ? Why do we need election instead of just simple any selection?

    As for why election instead of selection - in general, in a distributed system with eventual consistency, you need to have an election because there is no easy way to know which of the remaining nodes has the latest data and is thus qualified to become a leader.

    In case of Kafka -- for a setting with multiple replicas and ISRs, there could potentially be multiple nodes with up-todate data that of the leader.

    Kafka uses zookeeper only as an enabler for leader election. If a Kafka partition leader is down, Kafka cluster controller gets informed of this fact via zK and cluster controller chooses one of the ISR to be the new leader. So you can see that this "election" is different from that of a new leader election in a quorum based system like zK.

    Which broker among the ISR is "selected" is a bit more complicated (see) -

    Each replica stores messages in a local log and maintains a few important offset positions in the log. The log end offset (LEO) represents the tail of the log. The high watermark (HW) is the offset of the last committed message. Each log is periodically synced to disks. Data before the flushed offset is guaranteed to be persisted on disks.

    So when a leader fails, a new leader is elected by following:

    Now you can probably appreciate the benefit of a primary backup model when compared to a quorum model - using the above strategy, a Kafka 3 node cluster with 2 ISRs can tolerate 2 node failures -- including a leader failure -- at the same time and still get a new leader elected (though that new leader would have to reject new writes for a while till one of the failed nodes come live and catches up with the leader).

    The price to pay is of course higher write latency - in a 3 node Kafka cluster with 2 ISRs, the leader has to wait for an acknowledgement from both followers in-order to acknowledge the write to the client (producer). Whereas in a quorum model, a write could be acknowledged if one of the follower acknowledges.

    So depending upon the usecase, Kafka offers the possibility to trade durability over latency. 2 ISRs means you have sometimes higher write latency, but higher durability. If you run with only one ISR, then in case you lose the leader and an ISR node, you either have no availability or you can choose an unclean leader election in which case you have lower durability.

    Update - Leader election and preferred replicas:

    All nodes which make up the cluster are already registered in zK. When one of the node fails, zK notifies the controller node(which itself is elected by zK). When that happens, one of the live ISRs are selected as new leader. But Kafka has the concept of "preferred replica" to balance leadership distribution across cluster nodes. This is enabled using auto.leader.rebalance.enable=true, under which case controller will try to hand over leadership to that preferred replica. This preferred replica is the first broker in the list of ISRs. This is all a bit complicated, but only Kafka admin need to know about this.