How can we detect divergence in an Aeron cluster and take the diverging node out of the cluster?
For example if we have a 3 node aeron cluster, if one of the nodes in the cluster diverges from the other two nodes, how can we monitor and detect divergence as soon as it happens and prevent the diverging node from being elected the leader.
But snapshots are only generated periodically, and divergence maybe detected too late. And a diverging node could be elected as leader.
A problem ( i think) that should be taken into account, is the 3 cluster members may not all process a transaction/message at the exact same time. One or more members may be behind the others in processing messages from the log and therefore the comparison may have to wait until all members have had an opportunity to process the message.
Is this possible in Aeron cluster?
I have tried using snapshotting and writing transaction results to an archive and tracking archive positions to detect divergence.
But would like to see if there are better ways of detecting divergence
You could create an additional recorded IPC publication from within the clustered service that records all messages that are outbound (let's call this the egress-log). A separate application/process can then read the recorded egress log of each cluster node and compare the messages to detect divergence. To help manage the comparison of messages by this "divergence detecter" process I would have the cluster stamp each message with a monotonically increasing sequence number so you can ensure you properly comparing the same message (as well as detect gaps/dupes).