Context
I'm trying to prepare working pair (single master+slave) of Artemis (v2.37) on AKS (k8s) cluster with persistence volumes (we use Azure Storage account). We use KUBE_PING for address discovery.
We have been using replication feature for several months, but the split brain problem occurs too often. I want to change it to shared-store.
The current (not working scenario)
After my change to shared store solution I face scenario with 4 steps:
Expected behaviour
Master works as master without restarting. Slave works as slave.
Debugging
I searched master logs (I can't paste here 5.5k lines) and found these before pod restarts:
2024-09-24 08:03:05,344 DEBUG [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] Lock appears to be valid; double check by reading status
2024-09-24 08:03:05,344 DEBUG [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] getting state...
2024-09-24 08:03:05,344 DEBUG [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] trying to lock position: 0
2024-09-24 08:03:05,350 DEBUG [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] locked position: 0
2024-09-24 08:03:05,350 DEBUG [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] lock: sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid]
2024-09-24 08:03:05,355 DEBUG [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] state: L
2024-09-24 08:03:05,355 DEBUG [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] Lock appears to be valid; triple check by comparing timestamp
2024-09-24 08:03:05,357 DEBUG [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] Lock file /var/lib/artemis-instance/data/journal/server.lock originally locked at 2024-09-24T08:02:33.067+0000 was modified at 2024-09-24T08:02:35.181+0000
2024-09-24 08:03:05,358 WARN [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] Lost the lock according to the monitor, notifying listeners
2024-09-24 08:03:05,358 ERROR [org.apache.activemq.artemis.core.server] AMQ222010: Critical IO Error, shutting down the server. file=Lost NodeManager lock, message=NULL
java.io.IOException: lost lock
In meantime there are errors related to netty connection which looks more like warning that Artemis instance haven't stared yet. The artemis.artemis.svc.cluster.local is master pod address (If I understand correctly netty on master asks itself if it's working).
2024-09-24 08:03:02,454 ERROR [org.apache.activemq.artemis.core.client] AMQ214016: Failed to create netty connection
java.net.UnknownHostException: artemis.artemis.svc.cluster.local
Questions
What did I wrong? Do I miss an important parameter? Maybe there is some timeout to increase which I missed in the documentation?
For replication the same configuration is working (master starts without restart loop).
Configuration files
artemis-roles.properties: |
amq = admin
admin = admin,guest
artemis-users.properties: |
admin = admin
guest = guest
artemis.profile: |
ARTEMIS_HOME='/opt/artemis'
ARTEMIS_INSTANCE='/var/lib/artemis-instance'
ARTEMIS_DATA_DIR='/var/lib/artemis-instance/data'
ARTEMIS_ETC_DIR='/var/lib/artemis-instance/etc'
ARTEMIS_OOME_DUMP='/var/lib/artemis-instance/log/oom_dump.hprof'
ARTEMIS_INSTANCE_URI='file:/var/lib/artemis-instance/./'
ARTEMIS_INSTANCE_ETC_URI='file:/var/lib/artemis-instance/./etc/'
HAWTIO_ROLE='amq'
if [ -z "$JAVA_ARGS" ]; then
JAVA_ARGS="-XX:AutoBoxCacheMax=20000 -XX:+PrintClassHistogram -XX:+UseG1GC -XX:+UseStringDeduplication -Xms512M -Xmx2G -Dhawtio.disableProxy=true -Dhawtio.realm=activemq -Dhawtio.offline=true -Dhawtio.rolePrincipalClasses=org.apache.activemq.artemis.spi.core.security.jaas.RolePrincipal -Dhawtio.http.strictTransportSecurity=max-age=31536000;includeSubDomains;preload -Djolokia.policyLocation=${ARTEMIS_INSTANCE_ETC_URI}jolokia-access.xml -Dlog4j2.disableJmx=true "
fi
JAVA_ARGS="$JAVA_ARGS -Djava.net.preferIPv4Stack=true -Dipv4addr=$(hostname -i)"
if [ "$1" = "run" ]; then :
fi;
broker.xml: |
<configuration xmlns="urn:activemq"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xi="http://www.w3.org/2001/XInclude"
xsi:schemaLocation="urn:activemq /schema/artemis-configuration.xsd">
<core xmlns="urn:activemq:core" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:activemq:core ">
<name>{{ include "artemis.fullname" . }}.{{ .Release.Namespace }}.svc.cluster.local</name>
<persistence-enabled>true</persistence-enabled>
<max-redelivery-records>1</max-redelivery-records>
<paging-directory>/var/lib/artemis-instance/data/paging</paging-directory>
<bindings-directory>/var/lib/artemis-instance/data/bindings</bindings-directory>
<large-messages-directory>/var/lib/artemis-instance/data/large-messages</large-messages-directory>
<id-cache-size xmlns="urn:activemq:core">20000</id-cache-size>
<disk-scan-period>5000</disk-scan-period>
<max-disk-usage>90</max-disk-usage>
<critical-analyzer>true</critical-analyzer>
<critical-analyzer-timeout>180000</critical-analyzer-timeout>
<critical-analyzer-check-period>60000</critical-analyzer-check-period>
<critical-analyzer-policy>SHUTDOWN</critical-analyzer-policy>
<page-sync-timeout>512000</page-sync-timeout>
<global-max-messages>-1</global-max-messages>
<journal-type>ASYNCIO</journal-type>
<journal-directory>/var/lib/artemis-instance/data/journal</journal-directory>
<journal-datasync>true</journal-datasync>
<journal-min-files>2</journal-min-files>
<journal-pool-files>10</journal-pool-files>
<journal-device-block-size>4096</journal-device-block-size>
<journal-file-size>10M</journal-file-size>
<journal-buffer-timeout>144000</journal-buffer-timeout>
<journal-max-io>4096</journal-max-io>
<xi:include href="/var/lib/artemis-instance/etc/acceptor.xml"/>
<xi:include href="/var/lib/artemis-instance/etc/security-setting.xml"/>
<xi:include href="/var/lib/artemis-instance/etc/cluster-connection.xml"/>
<xi:include href="/var/lib/artemis-instance/etc/broadcast.xml"/>
<xi:include href="/var/lib/artemis-instance/etc/address.xml"/>
<xi:include href="/var/lib/artemis-instance/etc/address-setting.xml"/>
<xi:include href="/var/lib/artemis-instance/etc/discovery.xml"/>
<xi:include href="/var/lib/artemis-instance/etc/ha.xml"/>
<xi:include href="/var/lib/artemis-instance/etc/connector.xml"/>
</core>
</configuration>
acceptor.xml: |
<acceptors xmlns="urn:activemq:core">
<acceptor name="artemis">tcp://0.0.0.0:{{ .Values.conf.protocols.netty.port }}?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;amqpMinLargeMessageSize=102400;protocols=CORE,AMQP,STOMP,HORNETQ,MQTT,OPENWIRE;useEpoll=true;amqpCredits=1000;amqpLowCredits=300;amqpDuplicateDetection=true;supportAdvisory=false;suppressInternalManagementObjects=false</acceptor>
{{ if .Values.conf.protocols.amqp.enabled }}
<acceptor name="amqp">tcp://0.0.0.0:5672?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;protocols=AMQP;useEpoll=true;amqpCredits=1000;amqpLowCredits=300;amqpMinLargeMessageSize=102400;amqpDuplicateDetection=true</acceptor>
{{ end }}
{{ if .Values.conf.protocols.stomp.enabled }}
<acceptor name="stomp">tcp://0.0.0.0:{{ .Values.conf.protocols.stomp.port }}?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;protocols=STOMP;useEpoll=true</acceptor>
{{ end }}
{{ if .Values.conf.protocols.hornetq.enabled }}
<acceptor name="hornetq">tcp://0.0.0.0:5445?anycastPrefix=jms.queue.;multicastPrefix=jms.topic.;protocols=HORNETQ,STOMP;useEpoll=true</acceptor>
{{ end }}
{{ if .Values.conf.protocols.mqtt.enabled }}
<acceptor name="mqtt">tcp://0.0.0.0:1883?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;protocols=MQTT;useEpoll=true</acceptor>
{{ end }}
{{ if .Values.conf.protocols.ws.enabled }}
<acceptor name="stomp-ws-acceptor">tcp://0.0.0.0:61614?protocols=STOMP_WS</acceptor>
{{ end }}
</acceptors>
ha.xml: |
<ha-policy xmlns="urn:activemq:core">
# <replication> when replication enabled
<shared-store>
{{ .Values.conf.broker.ha | indent 20 }}
</shared-store>
</ha-policy>
cluster-connection.xml: |
<cluster-connections xmlns="urn:activemq:core">
<cluster-connection name="artemis">
<address>jms</address>
<connector-ref>{{ include "artemis.fullname" . }}</connector-ref>
<check-period>1000</check-period>
<connection-ttl>5000</connection-ttl>
<min-large-message-size>50000</min-large-message-size>
<call-timeout>120000</call-timeout>
<retry-interval>500</retry-interval>
<retry-interval-multiplier>1.0</retry-interval-multiplier>
<max-retry-interval>5000</max-retry-interval>
<initial-connect-attempts>-1</initial-connect-attempts>
<reconnect-attempts>-1</reconnect-attempts>
<use-duplicate-detection>true</use-duplicate-detection>
<forward-when-no-consumers>false</forward-when-no-consumers>
<max-hops>1</max-hops>
<confirmation-window-size>10000000</confirmation-window-size>
<call-failover-timeout>30000</call-failover-timeout>
<notification-interval>1000</notification-interval>
<notification-attempts>2</notification-attempts>
<discovery-group-ref discovery-group-name="jgroups-discovery" />
</cluster-connection>
</cluster-connections>
address.xml: |
<addresses xmlns="urn:activemq:core">
<address name="DLQ">
<anycast>
<queue name="DLQ" />
</anycast>
</address>
<address name="ExpiryQueue">
<anycast>
<queue name="ExpiryQueue" />
</anycast>
</address>
</addresses>
address-setting.xml: |
<address-settings xmlns="urn:activemq:core">
<address-setting match="activemq.management#">
<dead-letter-address>DLQ</dead-letter-address>
<expiry-address>ExpiryQueue</expiry-address>
<redelivery-delay>0</redelivery-delay>
<max-size-bytes>-1</max-size-bytes>
<message-counter-history-day-limit>10</message-counter-history-day-limit>
<address-full-policy>PAGE</address-full-policy>
<auto-create-queues>true</auto-create-queues>
<auto-create-addresses>true</auto-create-addresses>
</address-setting>
<address-setting match="#">
<dead-letter-address>DLQ</dead-letter-address>
<expiry-address>ExpiryQueue</expiry-address>
<redelivery-delay>0</redelivery-delay>
<message-counter-history-day-limit>10</message-counter-history-day-limit>
<address-full-policy>PAGE</address-full-policy>
<auto-create-queues>true</auto-create-queues>
<auto-create-addresses>true</auto-create-addresses>
<auto-delete-queues>false</auto-delete-queues>
<auto-delete-addresses>false</auto-delete-addresses>
<page-size-bytes>10M</page-size-bytes>
<max-size-bytes>-1</max-size-bytes>
<max-size-messages>-1</max-size-messages>
<max-read-page-messages>-1</max-read-page-messages>
<max-read-page-bytes>20M</max-read-page-bytes>
<page-limit-bytes>-1</page-limit-bytes>
<page-limit-messages>-1</page-limit-messages>
</address-setting>
</address-settings>
broadcast.xml: |
<broadcast-groups xmlns="urn:activemq:core">
<broadcast-group name="jgroups-broadcast">
<jgroups-file>jgroups-discovery.xml</jgroups-file>
<jgroups-channel>activemq_broadcast_channel</jgroups-channel>
<connector-ref>{{ include "artemis.fullname" . }}</connector-ref>
</broadcast-group>
</broadcast-groups>
discovery.xml: |
<discovery-groups xmlns="urn:activemq:core" >
<discovery-group name="jgroups-discovery">
<jgroups-file>jgroups-discovery.xml</jgroups-file>
<jgroups-channel>activemq_broadcast_channel</jgroups-channel>
<refresh-timeout>30000</refresh-timeout>
</discovery-group>
</discovery-groups>
jgroups-discovery.xml: |
<config xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="urn:org:jgroups"
xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/jgroups.xsd">
<TCP
external_addr="match-interface:eth0"
bind_addr="match-interface:eth0"
bind_port="7800"
thread_pool.min_threads="1"
/>
<org.jgroups.protocols.kubernetes.KUBE_PING
masterProtocol="https"
namespace="{{ .Release.Namespace }}"
labels="rl-type={{ .Values.conf.kubePing.name }}"
/>
<MERGE3 max_interval="30000" min_interval="10000"/>
<FD_SOCK start_port="9000"/>
<FD_ALL timeout="30000" interval="5000"/>
<VERIFY_SUSPECT timeout="1500"/>
<BARRIER />
<pbcast.NAKACK2
xmit_interval="500"
xmit_table_num_rows="100"
xmit_table_msgs_per_row="2000"
xmit_table_max_compaction_time="30000"
use_mcast_xmit="false"
discard_delivered_msgs="true" />
<UNICAST3
xmit_table_num_rows="100"
xmit_table_msgs_per_row="1000"
xmit_table_max_compaction_time="30000"/>
<pbcast.GMS print_local_addr="true" join_timeout="3000"/>
<MFC max_credits="2M" min_threshold="0.4"/>
<FRAG2 frag_size="60K"/>
<pbcast.STATE_TRANSFER/>
<COUNTER/>
</config>
log4j2.properties: |
monitorInterval = 5
rootLogger = {{ .Values.conf.log_level }}, console, log_file
logger.activemq.name=org.apache.activemq
logger.activemq.level={{ .Values.conf.log_level }}
logger.artemis_server.name=org.apache.activemq.artemis.core.server
logger.artemis_server.level={{ .Values.conf.log_level }}
logger.artemis_journal.name=org.apache.activemq.artemis.journal
logger.artemis_journal.level={{ .Values.conf.log_level }}
logger.artemis_utils.name=org.apache.activemq.artemis.utils
logger.artemis_utils.level={{ .Values.conf.log_level }}
logger.critical_analyzer.name=org.apache.activemq.artemis.utils.critical
logger.critical_analyzer.level={{ .Values.conf.log_level }}
logger.audit_base = OFF, audit_log_file
logger.audit_base.name = org.apache.activemq.audit.base
logger.audit_base.additivity = false
logger.audit_resource = OFF, audit_log_file
logger.audit_resource.name = org.apache.activemq.audit.resource
logger.audit_resource.additivity = false
logger.audit_message = OFF, audit_log_file
logger.audit_message.name = org.apache.activemq.audit.message
logger.audit_message.additivity = false
logger.jetty.name=org.eclipse.jetty
logger.jetty.level=WARN
logger.authentication_filter.name=io.hawt.web.auth.AuthenticationFilter
logger.authentication_filter.level=ERROR
logger.curator.name=org.apache.curator
logger.curator.level=WARN
logger.zookeeper.name=org.apache.zookeeper
logger.zookeeper.level=ERROR
appender.console.type=Console
appender.console.name=console
appender.console.layout.type=PatternLayout
appender.console.layout.pattern=%d %-5level [%logger] %msg%n
appender.log_file.type = RollingFile
appender.log_file.name = log_file
appender.log_file.fileName = ${sys:artemis.instance}/log/artemis.log
appender.log_file.filePattern = ${sys:artemis.instance}/log/artemis.log.%d{yyyy-MM-dd}
appender.log_file.layout.type = PatternLayout
appender.log_file.layout.pattern = %d %-5level [%logger] %msg%n
appender.log_file.policies.type = Policies
appender.log_file.policies.cron.type = CronTriggeringPolicy
appender.log_file.policies.cron.schedule = 0 0 0 * * ?
appender.log_file.policies.cron.evaluateOnStartup = true
appender.audit_log_file.type = RollingFile
appender.audit_log_file.name = audit_log_file
appender.audit_log_file.fileName = ${sys:artemis.instance}/log/audit.log
appender.audit_log_file.filePattern = ${sys:artemis.instance}/log/audit.log.%d{yyyy-MM-dd}
appender.audit_log_file.layout.type = PatternLayout
appender.audit_log_file.layout.pattern = %d [AUDIT](%t) %msg%n
appender.audit_log_file.policies.type = Policies
appender.audit_log_file.policies.cron.type = CronTriggeringPolicy
appender.audit_log_file.policies.cron.schedule = 0 0 0 * * ?
appender.audit_log_file.policies.cron.evaluateOnStartup = true
management.xml: |
<management-context xmlns="http://activemq.apache.org/schema">
<authorisation>
<allowlist>
<entry domain="hawtio"/>
</allowlist>
<default-access>
<access method="list*" roles="amq"/>
<access method="get*" roles="amq"/>
<access method="is*" roles="amq"/>
<access method="set*" roles="amq"/>
<access method="browse*" roles="amq"/>
<access method="count*" roles="amq"/>
<access method="*" roles="amq"/>
</default-access>
<role-access>
<match domain="org.apache.activemq.artemis">
<access method="list*" roles="amq"/>
<access method="get*" roles="amq"/>
<access method="is*" roles="amq"/>
<access method="set*" roles="amq"/>
<access method="browse*" roles="amq"/>
<access method="count*" roles="amq"/>
<access method="*" roles="amq"/>
</match>
</role-access>
</authorisation>
</management-context>
bootstrap.xml: |
{{ if .Values.conf.protocols.http.enabled }}
<broker xmlns="http://activemq.apache.org/schema">
<jaas-security domain="activemq"/>
<server configuration="file:/var/lib/artemis-instance/etc/broker.xml"/>
<web path="web" rootRedirectLocation="console">
<binding name="artemis" uri="http://0.0.0.0:{{ .Values.conf.protocols.http.port }}">
<app name="branding" url="activemq-branding" war="activemq-branding.war"/>
<app name="plugin" url="artemis-plugin" war="artemis-plugin.war"/>
<app name="console" url="console" war="console.war"/>
</binding>
</web>
</broker>
{{ end }}
jolokia-access.xml: |
<restrict>
<cors>
<allow-origin>*://*</allow-origin>
<strict-checking/>
</cors>
</restrict>
login.config: |
activemq {
org.apache.activemq.artemis.spi.core.security.jaas.PropertiesLoginModule sufficient
debug=false
reload=true
org.apache.activemq.jaas.properties.user="artemis-users.properties"
org.apache.activemq.jaas.properties.role="artemis-roles.properties";
org.apache.activemq.artemis.spi.core.security.jaas.GuestLoginModule sufficient
debug=false
org.apache.activemq.jaas.guest.user="amq"
org.apache.activemq.jaas.guest.role="amq";
};
security-setting.xml: |
<security-settings xmlns="urn:activemq:core">
<security-setting match="#">
<permission type="createNonDurableQueue" roles="amq"/>
<permission type="deleteNonDurableQueue" roles="amq"/>
<permission type="createDurableQueue" roles="amq"/>
<permission type="deleteDurableQueue" roles="amq"/>
<permission type="createAddress" roles="amq"/>
<permission type="deleteAddress" roles="amq"/>
<permission type="consume" roles="amq"/>
<permission type="browse" roles="amq"/>
<permission type="send" roles="amq"/>
<permission type="manage" roles="amq"/>
</security-setting>
</security-settings>
connector.xml: |
<connectors xmlns="urn:activemq:core">
<connector name="{{ include "artemis.fullname" . }}">tcp://{{ include "artemis.fullname" . }}.{{ .Release.Namespace }}.svc.cluster.local:{{ .Values.conf.protocols.netty.port }}</connector>
</connectors>
shared-store HA block specific for master:
<primary>
<failover-on-shutdown>true</failover-on-shutdown>
<wait-for-activation>false</wait-for-activation>
</primary>
shared-store HA block specific for slave
<backup>
<failover-on-shutdown>true</failover-on-shutdown>
<allow-failback>true</allow-failback>
</backup>
Replication HA block for master (used before shared-store change)
<primary>
<check-for-active-server>true</check-for-active-server>
<initial-replication-sync-timeout>600</initial-replication-sync-timeout>
</primary>
Replication Ha block for slave (used before shared-store change)
<backup>
<allow-failback>true</allow-failback>
</backup>
I googled for similar issues.
The fact that you're seeing this error:
ERROR [org.apache.activemq.artemis.core.server] AMQ222010: Critical IO Error, shutting down the server. file=Lost NodeManager lock, message=NULL
java.io.IOException: lost lock
indicates that the shared storage device/protocol that you're using doesn't support the proper file locking semantics or perhaps file locking is not configured properly for the mount.
What's happening is that the primary broker is starting and acquiring a lock on the shared journal. When the backup broker starts it appears that it is also able to acquire the lock on the shared journal. When the backup modifies a file that should be locked by the primary the primary sees this and shuts itself down to avoid split brain.
I recommend you investigate the storage device/protocol you're using and ensure it supports exclusive file locking across the network and that such locking is properly configured.