I've quite understood what the Raft is and implemented it in MIT6.824 distributed system
. I also know what's the basic Paxos, I've not implemented this yet, so I can't grab all details of it. For Multi-Paoxs
, I'm even more confused, i.e., WHY it can eliminate lots of Prepare RPC? I know the answer should be Multi-Paxos can have a fixed leader along with noMoreAccepted
response from other peers to determine if reduce Prepare
RPC. But, I can't get it in detailed level, why and how it works
I want to get more some recommendations, articles, sample code or anything that can help for Multi-Paxos,
Paxos made live
and Paxos made simple
, those two papers can give me a basic idea about what's Paxos and how it worksTo answer your specific question:
WHY it can eliminate lots of Prepare RPC?
In the paper Paxos Made Simple page 10 it says:
A newly chosen leader executes phase 1 for infinitely many instances of the consensus algorithm—in the scenario above, for instances 135–137 and all instances greater than 139.
That is saying that if a leader broadcasts Prepare(135,n)
which is a prepare for instance 135 using ballot number n
then it is valid that this can be defined as applying to all instances >=135 that are not yet fixed. We can reason that it is safe for any node to be "spamming" out prepare messages for an infinite number of the unfixed positions in our log stream. This is because for each position each acceptor uses the rules of Paxos for that position. We can compress that infinite set of prepare messages down to a single one that applies to all higher unfixed positions. We then eliminate all but one prepare message for the term of a stable leader. So it is fantastic optimisation.
You asked about any example code. I wrote an implementation of multi-paxos using functional programming in Scala that aims to be true to the paper Paxos Made Simple over at https://github.com/trex-paxos/trex. The core state is PaxosData, the message protocol is at the bottom of PaxosProtcol and the algorithm is a set of message matching functions in PaxosAlgorithm. The algorithm takes the current immutable state and an immutable message as input and outputs the next immutable state for the node. Common behaviours are written as partial functions that have full unit tests. These partial functions are composed into complete functions used by leaders, followers and candidate leaders. There is a write up at this blog.
It adds additional messages to the basic set as optimisations speed up log replication. Those involve some implementation details that Lamport does not get into in his paper. An example is that negative acknowledgements are used to pass information between nodes to try to avoid interrupting a stable leader due to only one failed network link between a node and the leader. TRex tries to keep those features to a minimum to create a basic but complete solution.
An answer that you might find helpful about Multi-Paxos is this one that discusses why Multi-Paxos is called that https://stackoverflow.com/a/26619261/329496
There is also this one about how the original Part-Time Parliament paper uses a leader and also describes a stable leader running multi-Paxos https://stackoverflow.com/a/46012211/329496
Finally, you might enjoy my defence of Paxos post The Trial Of Paxos Algorithm.