Node thinks that it is online when it's network cable is unplugged. Pacemaker/Corosync

I am trying to cluster 2 computers together with Pacemaker/Corosync. The only resource that they share is an ocf:heartbeat:IPaddr this is the main problem:

Since there are only two nodes failover will only occur if the no-quorum-policy=ignore.

When the network cable is pulled from node A, corosync on node A binds to 127.0.0.1 and pacemaker believes that node A is still online and the node B is the one offline.

Pacemaker attempts to start the IPaddr on Node A but it fails to start because there is no network connection. Node B on the other hand recognizes that node B is offline and if the IPaddr service was started on node A it will start it on itself (node B) successfully.

However, since the service failed to start on node A it enters a fatal state and has to be rebooted to rejoin the cluster. (you could restart some of the needed services instead.)

1 workaround is the set start-failure-is-fatal="false" which makes node A continue to try to start the IPaddr service until it is successful. the problem with this is that once it is successful you have a ip conflict between the two nodes until they re cluster and one of the gives up the resource.

I am playing around with the idea of having a node attribute that mirrors cat /sys/class/net/eth0/carrier which is 1 when the cable is connected and zero when it is disconnected and then having a location rule that says if "connected" == zero don't start service kind of thing, but we'll see.

Any thoughts or ideas would be greatly appreciated.

Solution

After speaking with Andrew Beekhof (Author of Pacemaker) and Digimer on the freenote.net/#linux-cluster irc network, I have learned that the actual cause behind this issue is do to the cluster being improperly fenced.

Fencing or having stonith enabled is absolutely essential to having a successful High Availability Cluster. The following page is a must read on the subject:

Cluster Tutorial: Concept - Fencing

Many thanks to Digimer for providing this invaluable resource. The section on clustering answers this question, however the entire article is beneficial.

Basically fencing and S.T.O.N.I.T.H. (Shoot the other node in the head) are mechanisms that a cluster uses to make sure that a down node is actually dead. It needs to do this to avoid shared memory corruption, split brain status (multiple nodes taking over shared resources), and most make sure that your cluster does not get stuck in recovery or crash.

If you don't have stonith/fencing configured and enabled in your cluster environment you really need it.

Other issues to look out for are Stonith Deathmatch, and Fencing Loops.

In short the issue of loss of network connectivity causing split brain was solved by creating our own Stonith Device and writing a stonith agent following the /usr/share/doc/cluster-glue/stonith/README.external tutorial, and then writing a startup script that checks to see if the node is able to support joining the cluster and then starts corosync or waits 5 minutes and checks again.