mysqlmariadbdatabase-clustergalera

What is the defined behavior of a 3-node Galera cluster after one node dies?


I've been reading the documentation for Galera cluster: http://galeracluster.com/documentation-webpages/genindex.html

I keep seeing the recommendation (or, in some places, an explicit restriction) that a minimum cluster is 3 nodes.

My question is, what happens to a minimum cluster after one node fails.


Solution

  • It depends on how the node leaves the cluster. The situations below assume a three node cluster with one node leaving and all of the nodes connected by an Ethernet switch.

    If one node shuts itself down gracefully due to the service being restarted or a replication issue, then the cluster becomes a two node cluster and nothing major happens. The cluster will continue to function normally and, if no queries were being served by the server that left, there would be no interruption in operation.

    If the node goes missing due to a network problem or otherwise leaves without telling the rest of the cluster, then problems can arise. For the cluster to function, it needs a quorum, a majority of nodes active in the cluster. The two other nodes will continue to function normally since their partition has more than half of the known nodes but the node that left will stop accepting queries when it realizes that it is no longer in contact with the active partition. In this case, assuming an application can access the two active nodes, the failure can go mostly unnoticed.

    The main reason that three servers is the recommended minimum is to increase the chance that a quorum will exist in the event of a network problem. If a cluster has two nodes (or more generally any even number of nodes) a single network link failure could cause the cluster to pause since it could create two partitions with half of the nodes, neither having a quorum. An odd number of nodes means that a single network link failure cannot cause the cluster to pause since there will always be quorum. If there is more than one network link failure, however, things can get more complicated but only a partition with a quorum will function normally.

    If a node tries to connect to the active partition in a cluster, it will join normally. If it is only able to connect to an inactive partition, it will wait for a configurable amount of time while trying to contact the active partition.

    More information is available at http://galeracluster.com/documentation-webpages/recovery.html.