I am running ActiveMQ Artemis on Kubernetes and trying to configure high availability (HA) with shared storage. However, I am facing an issue where the primary pod goes into a restart loop after enabling the shared store HA policy.
My question is an extension of this one, as I am experiencing the same issue but have also experimented with an alternative setup.
What I Tried
Configured HA with shared store:
Primary Pod
<ha-policy>
<shared-store>
<primary>
<failover-on-shutdown>true</failover-on-shutdown>
</primary>
</shared-store>
</ha-policy>
Secondary Pod
<ha-policy>
<shared-store>
<backup>
<allow-failback>false</allow-failback>
<failover-on-shutdown>true</failover-on-shutdown>
</backup>
</shared-store>
</ha-policy>
Observed Issue:
ERROR [org.apache.activemq.artemis.core.server] AMQ222010: Critical IO Error, shutting down the server. file=Lost NodeManager lock, message=NULL
java.io.IOException: lost lock
What change I tried:
Tested Running without HA Policy but in a Clustered Mode:
Questions:
You see "Lost NodeManager lock" when using a shared-store
ha-policy
because that configuration causes the broker to actively monitor the shared file lock while the broker is running.
Without a shared-store
ha-policy
your primary broker might lose the shared file lock without realizing it in which case the backup would activate and both the primary and the backup would be operating simultaneously (i.e. split brain). Therefore, I would not recommend a simple clustered setting using shared storage without a shared-store
ha-policy
.
I recommend you inspect the configuration and features of the shared storage device to ensure it is able to support exclusive shared file locks. I also recommend you monitor the shared storage device to ensure there are no intermittent problems that would cause the primary broker to lose its lock.
You can enable TRACE
logging for org.apache.activemq.artemis.core.server.impl.FileLockNodeManager
to help you identify why the primary broker is losing its shared file lock.