kubernetesactivemq-artemis

ActiveMQ Artemis: Primary Pod Restart Loop with Shared Store HA


I am running ActiveMQ Artemis on Kubernetes and trying to configure high availability (HA) with shared storage. However, I am facing an issue where the primary pod goes into a restart loop after enabling the shared store HA policy.

My question is an extension of this one, as I am experiencing the same issue but have also experimented with an alternative setup.

What I Tried

Configured HA with shared store:

Primary Pod

<ha-policy>
    <shared-store>
        <primary>
            <failover-on-shutdown>true</failover-on-shutdown>
        </primary>
    </shared-store>
</ha-policy>

Secondary Pod

<ha-policy>
    <shared-store>
        <backup>
            <allow-failback>false</allow-failback>
            <failover-on-shutdown>true</failover-on-shutdown>
        </backup>
    </shared-store>
</ha-policy>

Observed Issue:

ERROR [org.apache.activemq.artemis.core.server] AMQ222010: Critical IO Error, shutting down the server. file=Lost NodeManager lock, message=NULL
java.io.IOException: lost lock

What change I tried:

Tested Running without HA Policy but in a Clustered Mode:

Questions:

  1. Why does the shared store HA setup cause the "Lost NodeManager lock" error, but a simple clustered setup with shared storage works fine?
  2. If I continue using a clustered setup without an HA policy but with shared storage, is this an acceptable and recommended approach?
  3. What are the risks of running a clustered ActiveMQ Artemis setup with shared storage but without an HA policy?

Solution

  • You see "Lost NodeManager lock" when using a shared-store ha-policy because that configuration causes the broker to actively monitor the shared file lock while the broker is running.

    Without a shared-store ha-policy your primary broker might lose the shared file lock without realizing it in which case the backup would activate and both the primary and the backup would be operating simultaneously (i.e. split brain). Therefore, I would not recommend a simple clustered setting using shared storage without a shared-store ha-policy.

    I recommend you inspect the configuration and features of the shared storage device to ensure it is able to support exclusive shared file locks. I also recommend you monitor the shared storage device to ensure there are no intermittent problems that would cause the primary broker to lose its lock.

    You can enable TRACE logging for org.apache.activemq.artemis.core.server.impl.FileLockNodeManager to help you identify why the primary broker is losing its shared file lock.