I have recently started working on clusters, if you want any more info let me know .
I have a active-active HA cluster. Its designed to work during failover scenario.
I have Node1 and Node2 as a active-active cluster. pacemaker and corosync are used as cluster manger. Both NODES have 1 resource group with 3 resources each.
When Node1 goes down Node2 takes over its resources as expected. When Node1 is back online, pcs first stops node1 resources in node2 and then it starts them in node1 which is also expected and is working fine .
Issue : Am facing issue when both the nodes are booted at the same time.
scenario: When both the nodes are powered off and then powered on at same time. Lets say Node2 booted first, then PCS sees the node1 is still offline(still booting) and starts node1 resources in node2.Then it also starts its own resources in node2
so at same time when node1 is completely booted , its starts its own resource. Here the problem is before it starts its not stopping the node1 resources currently started(failover) in node2.
So at end node1 has its resources started in node1 and node2 also has both node1 & node2 resources started in node2.
The above scenario never happens when they are booted with time difference(15 min). Also it works fine when only one node is rebooted or powered off.
# pcs property list --all
Cluster Properties:
batch-limit: 0
cluster-delay: 60s
cluster-infrastructure: cman
cluster-recheck-interval: 15min
crmd-finalization-timeout: 30min
crmd-integration-timeout: 3min
crmd-transition-delay: 0s
dc-deadtime: 20s
dc-version: 1.1.11-97629de
default-action-timeout: 20s
default-resource-stickiness: 0
election-timeout: 2min
enable-startup-probes: true
expected-quorum-votes: 2
is-managed-default: true
last-lrm-refresh: 1565098302
load-threshold: 80%
maintenance-mode: false
migration-limit: -1
no-quorum-policy: ignore
node-action-limit: 0
node-health-green: 0
node-health-red: -INFINITY
node-health-strategy: none
node-health-yellow: 0
pe-error-series-max: -1
pe-input-series-max: 4000
pe-warn-series-max: 5000
placement-strategy: default
remove-after-stop: false
shutdown-escalation: 20min
start-failure-is-fatal: true
startup-fencing: true
stonith-action: reboot
stonith-enabled: false
stonith-timeout: 60s
stop-all-resources: false
stop-orphan-actions: true
stop-orphan-resources: true
symmetric-cluster: false
I was able fix this issue by using pcs 0.9.155 version. The older pcs version had this bug when simultaneous reboot happened.