messagingstate-machineaeron

How to detect divergence at a transaction/message level in Aeron Cluster?


How can we detect divergence in an Aeron cluster and take the diverging node out of the cluster?

For example if we have a 3 node aeron cluster, if one of the nodes in the cluster diverges from the other two nodes, how can we monitor and detect divergence as soon as it happens and prevent the diverging node from being elected the leader.

  1. We could periodically generates snapshot of state on all 3 nodes in the cluster, hash the snapshots and verify they are the same.

But snapshots are only generated periodically, and divergence maybe detected too late. And a diverging node could be elected as leader.

  1. Every time a message/transaction is processed, the resultant state/outcome on each cluster node should be compared, and if divergence is noticed, a message logged/alert generated identifying the message/transaction whose processing caused the divergence and the diverging node should be taken out of the cluster.

A problem ( i think) that should be taken into account, is the 3 cluster members may not all process a transaction/message at the exact same time. One or more members may be behind the others in processing messages from the log and therefore the comparison may have to wait until all members have had an opportunity to process the message.

Is this possible in Aeron cluster?

I have tried using snapshotting and writing transaction results to an archive and tracking archive positions to detect divergence.

But would like to see if there are better ways of detecting divergence


Solution

  • You could create an additional recorded IPC publication from within the clustered service that records all messages that are outbound (let's call this the egress-log). A separate application/process can then read the recorded egress log of each cluster node and compare the messages to detect divergence. To help manage the comparison of messages by this "divergence detecter" process I would have the cluster stamp each message with a monotonically increasing sequence number so you can ensure you properly comparing the same message (as well as detect gaps/dupes).