activemq-artemis

How do I make a node rejoin a cluster on ActiveMQ Artemis?


I have an environment with 3 pairs of primary-backup servers.

After an incident, we restarted the broker according to this documentation.

It all went good for the first 2 primary servers, but the 3rd primary server won't join the cluster with the following errors spamming the logs:

On the primary node that is failing:

AMQ224135: nodeID cfe10fd7-3dcc-11ee-9a66-00505698d53b is closing. Topology update ignored

Where the nodeID is the one of the node that is failing.

And on the other primary nodes:

AMQ212037: CORE connection failure to 10.21.118.71:38570 has been detected: AMQ229014: Did not receive data from 10.21.118.71:38570 within the 60000ms connection TTL

Where the IP is the one of the node failing.

We also noticed the address activemq.notifications and internal queues $.artemis.internal.sf.amq-cluster.<node-id> towards the failing node filling up with thousands of messages. The ones on activemq.notifications are consumed but not the ones on the internal queues.

I searched for the code AMQ224135 but it didn't yield any results.

We retried the start sequence one again, but the result and behaviour was the same.


Solution

  • Here is our solution to this problem.

    When restarting the 3rd primary broker, we noticed that messages were accumulating in the internal queue between the 2nd and the 3rd broker ($.artemis.internal.sf.amq-cluster.).

    When looking at these messages in the console, we found 1 business related message duplicating. This business message was sent via the console and was supposed to be sent on one of our existing addresses but was supposely sent on the internal queue by error. In the listing of messages in the console, we could see this message duplicating but with the same messageId.

    Our solution was to purge the internal queue and as soon as we did that, the 3rd primary broker was available again and synced with its backup server instantly.