We have a ReliableMessageListener
that synchronizes some data structures that it holds across the cluster by the onMessage
implementation.
The cluster is composed of three nodes. We noticed that one of the topics gets out of sync, and had been terminated due to message loss, detected by the ring-buffer, as we get a "Terminating MessageListener, ... Reason: Underlying ring buffer data related to reliable topic is lost
" exception. What happens is that this node is still up, but this specific listener does not get events/messages from the other two nodes, while they do get those from it.
We get to a de-facto split-brain for this specific topic.
Our message listener is configured as isLossTolerant = false
, and isTerminal = false
.
I am trying to understand what is considered to be a good strategy for handling such a scenario and recovering from it.
For example, is that a good practice to try and subscribe this topic again? Is that a good practice to send a message for clearing the data from the other nodes in the cluster? Will they even get the message after the ring-buffer got out of sync?
Thanks
The message Reason: Underlying ring buffer data related to the reliable topic is lost
means that the data you are trying to read is not available anymore because it was overwritten by newer data in the underlying Ringbuffer - your producer is likely faster than your consumers.
When such a situation occurs the ReliableTopic is still usable, and you can register a new listener.
To prevent the situation from occurring you can either increase the size of the underlying ringbuffer (provide ringbuffer config with same name as the reliable topic) or configure TopicOverloadPolicy
. See documentation for details.