failoverdisruptor-patternlmax

Disruptor pattern - how are master and slave nodes kept in sync?


In the LMAX Disruptor pattern, the replicator is used to replicate the input events from a master node to slave node. So the setup would probably look like the following:

enter image description here

The replicator of master node writes the events to a DB (all though we can think better mechanisms than writing to a DB- it is not very important to the problem statement). The Receiver of the slave node reads from the DB and puts the events onto the slave node's ring buffer.

The output events of the slave node are ignored.

Now there is chance that the Business Logic Processor of Master node will be slower than the Business Logic Processor of the Slave Node. For Example BL of Master node could be at slot 102, where as Slave node could be at 106. (This can occur because the replicator reads the event from the ring buffer before the business logic processor).

In such a scenario if master node fails and slave node now becomes the master node, a few crucial events could be missed by the external systems. This could happen because Node 2 when it was acting as slave node has its output ignored.

Martin Fowler does state that the job of the replicator is to keep the nodes in sync: "Earlier on I mentioned that LMAX runs multiple copies of its system in a cluster to support rapid failover. The replicator keeps these nodes in sync"

But I am not sure how it can keep Business Logic Processor in sync? Any ideas?


Solution

  • Replication is directly from master to slave node and not via a database. The replication gates on acknowledgement from the slave.

    http://www.infoq.com/presentations/LMAX

    The link above goes into more detail and it is worth reading the comments discussion on the presentation.