high-availability failover pacemaker corosync

After re-transmission failure from one node to another, both node mark each other as dead and does not show status of each other in crm_mon

So in starting Node 1 not showing Node 2 and similarly Node 2 does not show Node 1 in crm_mon command

After analyzing corosync log I found that because of multiple retransmit failure both nodes mark each other as dead so I tried to stop and start the corosync and pacemaker but still they are not forming cluster and does not show each other in crm_mon

Logs of Node 2:

For srv-vme-ccs-02

Oct 30 02:22:49 srv-vme-ccs-02 crmd[1973]: notice: crm_update_peer_state: plugin_handle_membership: Node srv-vme-ccs-01[2544637100] - state is now member (was (null)

It is member till now

Oct 30 10:07:34 srv-vme-ccs-02 corosync[1613]: [TOTEM ] Retransmit List: 117 Oct 30 10:07:35 srv-vme-ccs-02 corosync[1613]: [TOTEM ] Retransmit List: 118 Oct 30 10:07:35 srv-vme-ccs-02 corosync[1613]:
[TOTEM ] FAILED TO RECEIVE Oct 30 10:07:49 srv-vme-ccs-02 arpwatch: bogon 192.168.0.120 d4:be:d9:af:c6:23 Oct 30 10:07:59 srv-vme-ccs-02 corosync[1613]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 232: memb=1, new=0, lost=1 Oct 30 10:07:59 srv-vme-ccs-02 corosync[1613]: [pcmk ] info: pcmk_peer_update: memb: srv-vme-ccs-02 2561414316 Oct 30 10:07:59 srv-vme-ccs-02 corosync[1613]: [pcmk ] info: pcmk_peer_update: lost: srv-vme-ccs-01 2544637100 Oct 30 10:07:59 srv-vme-ccs-02 corosync[1613]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 232: memb=1, new=0, lost=0 Oct 30 10:07:59 srv-vme-ccs-02 corosync[1613]: [pcmk ] info: pcmk_peer_update: MEMB: srv-vme-ccs-02 2561414316 Oct 30 10:07:59 srv-vme-ccs-02 corosync[1613]: [pcmk ] info: ais_mark_unseen_peer_dead: Node srv-vme-ccs-01 was not seen in the previous transition Oct 30 10:07:59 srv-vme-ccs-02 corosync[1613]:
[pcmk ] info: update_member: Node 2544637100/srv-vme-ccs-01 is now: lost Oct 30 10:07:59 srv-vme-ccs-02 corosync[1613]: [pcmk ] info: send_member_notification: Sending membership update 232 to 2 children Oct 30 10:07:59 srv-vme-ccs-02 corosync[1613]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Oct 30 10:07:59 srv-vme-ccs-02 corosync[1613]: [CPG ] chosen downlist: sender r(0) ip(172.20.172.152) ; members(old:2 left:1) Oct 30 10:07:59 srv-vme-ccs-02 crmd[1973]: notice: plugin_handle_membership: Membership 232: quorum lost Oct 30 10:07:59 srv-vme-ccs-02 corosync[1613]: [MAIN ] Completed service synchronization, ready to provide service. Oct 30 10:07:59 srv-vme-ccs-02 cib[1968]: notice: plugin_handle_membership: Membership 232: quorum lost Oct 30 10:07:59 srv-vme-ccs-02 crmd[1973]: notice: crm_update_peer_state: plugin_handle_membership: Node srv-vme-ccs-01[2544637100] - state is now lost (was member) Oct 30 10:07:59 srv-vme-ccs-02 cib[1968]: notice: crm_update_peer_state: plugin_handle_membership: Node srv-vme-ccs-01[2544637100] - state is now lost (was member) Oct 30 10:07:59 srv-vme-ccs-02 crmd[1973]: warning: reap_dead_nodes: Our DC node (srv-vme-ccs-01) left the cluster

Now srv-vme-ccs-01 is no more a member

On the other node, I find the similar logs of failed retransmit

Logs of Node 1

For srv-vme-ccs-01

Oct 30 09:48:32 [2000] srv-vme-ccs-01 pengine: info: determine_online_status: Node srv-vme-ccs-01 is online Oct 30 09:48:32 [2000] srv-vme-ccs-01 pengine: info: determine_online_status: Node srv-vme-ccs-02 is online

ct 30 09:48:59 [2001] srv-vme-ccs-01 crmd: info: update_dc: Unset DC. Was srv-vme-ccs-01 Oct 30 09:48:59 corosync [TOTEM ] Retransmit List: 107 108 109 10a 10b 10c 10d 10e 10f 110 111 112 113 114 115 116 117 Oct 30 09:48:59 corosync [TOTEM ] Retransmit List: 107 108 109 10a 10b 10c 10d 10e 10f 110 111 112 113 114 115 116 117 118

Oct 30 10:08:22 corosync [TOTEM ] A processor failed, forming new configuration. Oct 30 10:08:25 corosync [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 232: memb=1, new=0, lost=1 Oct 30 10:08:25 corosync [pcmk ] info: pcmk_peer_update: memb: srv-vme-ccs-01 2544637100 Oct 30 10:08:25 corosync [pcmk ] info: pcmk_peer_update: lost: srv-vme-ccs-02 2561414316 Oct 30 10:08:25 corosync [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 232: memb=1, new=0, lost=0 Oct 30 10:08:25 corosync [pcmk ] info: pcmk_peer_update: MEMB: srv-vme-ccs-01 2544637100 Oct 30 10:08:25 corosync [pcmk ] info: ais_mark_unseen_peer_dead: Node srv-vme-ccs-02 was not seen in the previous transition Oct 30 10:08:25 corosync [pcmk ] info: update_member: Node 2561414316/srv-vme-ccs-02 is now: lost Oct 30 10:08:25 corosync [pcmk ] info: send_member_notification: Sending membership update 232 to 2 children Oct 30 10:08:25 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Oct 30 10:08:25 [1996] srv-vme-ccs-01 cib: notice: plugin_handle_membership:
Membership 232: quorum lost Oct 30 10:08:25 [1996] srv-vme-ccs-01
cib: notice: crm_update_peer_state: plugin_handle_membership: Node srv-vme-ccs-02[2561414316] - state is now lost (was member) Oct 30 10:08:25 corosync [CPG ] chosen downlist: sender r(0) ip(172.20.172.151) ; members(old:2 left:1) Oct 30 10:08:25 [2001] srv-vme-ccs-01 crmd: notice: plugin_handle_membership:
Membership 232: quorum lost Oct 30 10:08:25 [2001] srv-vme-ccs-01
crmd: notice: crm_update_peer_state: plugin_handle_membership: Node srv-vme-ccs-02[2561414316] - state is now lost (was member) Oct 30 10:08:25 [2001] srv-vme-ccs-01 crmd: info: peer_update_callback: srv-vme-ccs-02 is now lost (was member) Oct 30 10:08:25 corosync [MAIN ] Completed service synchronization, ready to provide service. Oct 30 10:08:25 [2001] srv-vme-ccs-01
crmd: warning: match_down_event: No match for shutdown action on srv-vme-ccs-02 Oct 30 10:08:25 [1990] srv-vme-ccs-01 pacemakerd:
info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=9): Try again (6)

Oct 30 10:08:25 [2001] srv-vme-ccs-01 crmd: info: join_make_offer: Skipping srv-vme-ccs-01: already known 1 Oct 30 10:08:25 [2001] srv-vme-ccs-01 crmd: info: update_dc: Set DC to srv-vme-ccs-01 (3.0.7) Oct 30 10:08:25 [1996] srv-vme-ccs-01
cib: info: cib_process_request: Completed cib_modify operation for section crm_config: OK (rc=0, origin=local/crmd/185, version=0.116.3)

So at the same time on both node retransmission of message occur heavily (it occurs after server rebooted abruptly) and both the node mark each other as lost member and form individual cluster as marking itself as DC

Solution

I got the solution of this :

First as checked in tcpdump pacemkaer is using multicasting and upon investigating with Network team , we came to know multicasting is not enabled .

So when we removed mcastaddere and restarted corosync and pacemaker , but corosyn refused to start and said the error :

No mcastaddresss define in corosync.conf .

Laster on debugging found that synaxt for

transport: udpu

is not correct it was writter as following :

transport=udpu

So, corosync by default running is multicasting mode .

So , issue is resolved after correcting corosync.conf .