We've setup a sql2017 cluster on linux following MS Documentation. Replication in the AG works fine but we are unable to failover. If I watch the logs during failover pacemaker is attempting to move the AG but it fails and continues running on the primary.
On the master it reports the resource not running.
Oct 01 15:06:56 [4346] syncdb01a-stag lrmd: notice: operation_finished: ttsyncagresource_monitor_11000:6280:stderr [ resource ttsyncagresource is NOT running ]
On the secondary I see this unknown error:
Oct 01 15:06:57 [24662] syncdb01b-stag pengine: warning: unpack_rsc_op_failure: Processing failed start of ttsyncagresource:1 on syncdb01b-stag: unknown error | rc=1
If I run pcs status
I get the following. The most recent error it shows is what happens if I shutdown the primary node. The other two errors were due to sql permissions which are resolved.
[root@syncdb01a-stag oper]# pcs status
Cluster name: syncdb-stag
Stack: corosync
Current DC: syncdb01b-stag (version 1.1.20-5.el7_7.1-3c4c782f70) - partition with quorum
Last updated: Tue Oct 1 20:36:32 2019
Last change: Tue Oct 1 15:53:57 2019 by root via crm_resource on syncdb01a-stag
3 nodes configured
3 resources configured
Online: [ syncdb01a-stag syncdb01b-stag syncwit01-stag ]
Full list of resources:
Master/Slave Set: ttsyncagresource-master [ttsyncagresource]
Masters: [ syncdb01a-stag ]
Stopped: [ syncdb01b-stag syncwit01-stag ]
Failed Resource Actions:
* ttsyncagresource_monitor_11000 on syncdb01a-stag 'not running' (7): call=17, status=complete, exitreason='',
last-rc-change='Tue Oct 1 15:03:47 2019', queued=0ms, exec=0ms
* ttsyncagresource_start_0 on syncdb01b-stag 'unknown error' (1): call=17, status=complete, exitreason='2019/10/01 14:43:30 Did not find AG row in sys.availability_groups',
last-rc-change='Tue Oct 1 14:43:25 2019', queued=0ms, exec=5255ms
* ttsyncagresource_start_0 on syncwit01-stag 'unknown error' (1): call=17, status=complete, exitreason='2019/10/01 14:43:30 Did not find AG row in sys.availability_groups',
last-rc-change='Tue Oct 1 14:43:25 2019', queued=1ms, exec=5228ms
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
I've also removed any constraints (we are not using a virtual ip due to being multi-subnet)
[root@syncdb01a-stag oper]# pcs constraint
Location Constraints:
Ordering Constraints:
Colocation Constraints:
Ticket Constraints:
This is the output of pcs config
:
[root@syncdb01a-stag oper]# pcs config
Cluster Name: syncdb-stag
Corosync Nodes:
syncdb01a-stag syncdb01b-stag syncwit01-stag
Pacemaker Nodes:
syncdb01a-stag syncdb01b-stag syncwit01-stag
Resources:
Master: ttsyncagresource-master
Meta Attrs: notify=true
Resource: ttsyncagresource (class=ocf provider=mssql type=ag)
Attributes: ag_name=ttsyncag
Meta Attrs: failure=timeout=60s notify=true
Operations: demote interval=0s timeout=10 (ttsyncagresource-demote-interval-0s)
monitor interval=10 timeout=60 (ttsyncagresource-monitor-interval-10)
monitor interval=11 role=Master timeout=60 (ttsyncagresource-monitor-interval-11)
monitor interval=12 role=Slave timeout=60 (ttsyncagresource-monitor-interval-12)
notify interval=0s timeout=60 (ttsyncagresource-notify-interval-0s)
promote interval=0s timeout=60 (ttsyncagresource-promote-interval-0s)
start interval=0s timeout=60 (ttsyncagresource-start-interval-0s)
stop interval=0s timeout=10 (ttsyncagresource-stop-interval-0s)
Stonith Devices:
Fencing Levels:
Location Constraints:
Ordering Constraints:
Colocation Constraints:
Ticket Constraints:
Alerts:
No alerts defined
Resources Defaults:
No defaults set
Operations Defaults:
No defaults set
Cluster Properties:
cluster-infrastructure: corosync
cluster-name: syncdb-stag
cluster-recheck-interval: 2min
dc-version: 1.1.20-5.el7_7.1-3c4c782f70
have-watchdog: false
start-failure-is-fatal: true
stonith-enabled: false
Quorum:
Options:
I rebuilt the cluster from scratch and it worked fine, not exactly sure where I went wrong but this time I did a full system update this time before starting.