In Datastax's docs they say that there are four stages in the Paxos protocol (meaning, in a lightweight transaction):
- Prepare/Promise
- Read/Results
- Propose/Accept
- Commit/Acknowledge
while on the left side are the proposer's stages and on the right side are the acceptor's stages.
Then they try to explain the process:
A proposer prepares by sending a message to a quorum of acceptors that includes a proposal number. Each acceptor promises to accept the proposal if the proposal number is the highest they have received. Once the proposer receives a quorum of acceptors who promise, the value for the proposal is read from each acceptor and sent back to the proposer. The proposer figures out which value to use and proposes the value to a quorum of the acceptors along with the proposal number. Each acceptor accepts the proposal with a certain number if and only if the acceptor is not already promised to a proposal with a high number. The value is committed and acknowledged as a Cassandra write operation if all the conditions are met.
I've failed to understand this explanation. Could anybody explain it in a clearer way please?
Paxos algorithm at heart is a consensus algorithm. Let's say you have multiple coordinator Cassandra nodes and all of these nodes are proposing multiple updates. The Paxos algorithm ensures that among all the proposed updates at any given time, a single value is chosen and executed in order.
There are multiple stages to the algorithm and the first one is
Prepare/promise
In Paxos, the requests are executed in a specific order so we would want to assign a sequence number to each request and the request will be executed in the order based on the sequence number.
Clients send commands to the leader, which is basically the coordinator node in Cassandra, who decides wherein the sequence each command should appear.
In this stage, the leader tries to determine the correct sequence number of the request. If the leader decides that a certain client command should be the 135th command, it tries to have that command chosen as the value of the 135th instance of the consensus algorithm.
It creates a prepare request with the value and the sequence number as 135. The other Cassandra nodes(replica) will check if the number 135 is greater than the highest number they have received till now, if yes, the node will return a Promise that it will not accept any other request with a sequence number less than 135.
It might fail because of failures, or because another server also believes itself to be the leader and has a different idea of what the 135th command should be. If a replica node has already responded to a prepared request of higher number, in that case, it returns a promise but with the value of the promise that it responded for sequence 135 so that the leader node can also know about that and your original request becomes 136.
Once the majority of replica nodes have returned a promise to the Leader then the next phase is executed.
Propose/Accept
If the proposer receives a response to its prepare requests (numbered n) from a majority of acceptors, then it sends an accept request to each of those acceptors for a proposal numbered n with a value v, where v is the value of the highest-numbered proposal among the responses, or is any value if the responses reported no proposals(New Entry).
If an acceptor receives an accept request for a proposal numbered n, it accepts the proposal unless it has already responded to a prepare the request having a number greater than n.
That is how you ensure that the commands are executed in order.
Cassandra specific changes:
Read/result
All the Cassandra-Paxos queries are Compare-and-swap queries. Server checks the existing value and based on that update with the new value. For example, A increment counter operation may need that. This stage reads the existing value of the column and returns the results.
Commit/Acknowledge
At this stage, the actual write to the storage happens. After every replica has accepted the proposal, they still need to write it to the storage. So the replicas are writing the accepted value to Cassandra storage and sending an acknowledgment to the leader.
Honestly, I think that this system is most efficient when you have very less number of leader nodes(maybe 2). In the case of Cassandra, since every node can be a leader node at any point in time, there can be a lot of inefficiencies in the system.
The topic is difficult to explain in one answer but I will recommend yo to read this.