java memory activemq-classic leveldb kahadb

How to avoid blocking of queue browsing after ActiveMQ checkpoint call

There's a problem when Using ActiveMQ with a large number of Persistence Queues (250) á 1000 persistent TextMessages á 10 KB.

A scenario requires these messages to remain in the storage over a long time (days), until they are consumed (large amounts of data are staged for distribution for many consumer, that could be offline for some days).

After the Persistence Store is filled with these Messages and after a broker restart we can browse/consume some Queues until the #checkpoint call after 30 seconds.

This call causes the broker to use all available memory and never releases it for other tasks such as Queue browse/consume. Internally the MessageCursor seems to decide, that there is not enough memory and stops delivery of queue content to browsers/consumers.

=> Is there a way to avoid this behaviour by configuration or is this a bug?

The expectation is, that we can consume/browse any queue under all circumstances.

Settings below are in production for some time now and several recommendations are applied found in the ActiveMQ documentation (destination policies, systemUsage, persistence store options etc.)

Behaviour is tested with ActiveMQ: 5.11.2, 5.13.0 and 5.5.1.
Memory Settings: Xmx=1024m
Java: 1.8 or 1.7
OS: Windows, MacOS, Linux
PersistenceAdapter: KahaDB or LevelDB
Disc: enough free space (200 GB) and physical memory (16 GB max).

Besides the above mentioned settings we use the following settings for the broker (btw: changing the memoryLimit to a lower value like 1mb does not change the situation):

<destinationPolicy>
    <policyMap>
        <policyEntries>
            <policyEntry queue=">" producerFlowControl="false" optimizedDispatch="true" memoryLimit="128mb" timeBeforeDispatchStarts="1000">
                <dispatchPolicy>
                    <strictOrderDispatchPolicy />
                </dispatchPolicy>
                <pendingQueuePolicy>
                    <storeCursor />
                </pendingQueuePolicy>
            </policyEntry>
        </policyEntries>
    </policyMap>
</destinationPolicy>
<systemUsage>
    <systemUsage sendFailIfNoSpace="true">
        <memoryUsage>
            <memoryUsage limit="500 mb" />
        </memoryUsage>
        <storeUsage>
            <storeUsage limit="80000 mb" />
        </storeUsage>
        <tempUsage>
            <tempUsage limit="1000 mb" />
        </tempUsage>
    </systemUsage>
</systemUsage>

If we set the cursorMemoryHighWaterMark in the destinationPolicy to a higher value like 150 or 600 depending on the difference between memoryUsage and the available heap space relieves the situation a bit for a workaround, but this is not really an option for production systems in my point of view.

Screenie with information from Oracle Mission Control showing those ActiveMQTextMessage instances that are never released from memory:

Solution

We have a solution for our problem by changing the (queue) destination policyEntry.

After a thorough investigation (w/o changing ActiveMQ source code) the result is for now that we need to accept the limitations defined by the single memoryLimit parameter used both for the #checkpoint/cleanup process and browsing/consuming queues.

1.) Memory

There is not a problem, if we use a much higher memoryLimit (together with a higher max-heap) to support both the message caching per destination during the #checkpoint/cleanup workflow and our requirements to browse/consume messages.

But more memory is not an option in our scenario, we need to deal with 1024m max-heap and 500m memoryLimit.

Besides this, constantly setting higher memoryLimits just because of more persistent queues containing hundreds/thousands of pending messages together with certain offline/inactive consumer scenarios should be discussed in detail (IMHO).

2.) Persistent Adapters

We ruled out persistent adapters as the cause of the problem, because the behaviour doesn't change, if we switch different types of persistent stores (KahaDB, LevelDB, JDBC-PostgreSQL).

During the debugging sessions with KahaDB we also see regular checkpoint handling, the storage is managed as expected.

3.) Destination Policy / Expiration Check

Our problem completely disappears, if we disable caching and the expiration check, which is the actual cause of the problem.

The corresponding properties are documented and there is a nice blog article about Message Priorities with a description quite suitable for our scenario:

We simply added useCache="false" and expireMessagesPeriod="0" to the policyEntry:

<destinationPolicy>
    <policyMap>
        <policyEntries>
            <policyEntry queue=">" producerFlowControl="false" optimizedDispatch="true" memoryLimit="128mb" timeBeforeDispatchStarts="1000"
                                   useCache="false" expireMessagesPeriod="0">
                <dispatchPolicy>
                    <strictOrderDispatchPolicy />
                </dispatchPolicy>
                <pendingQueuePolicy>
                    <storeCursor />
                </pendingQueuePolicy>
            </policyEntry>
        </policyEntries>
    </policyMap>
</destinationPolicy>

The consequences are clear, if we don't use in-mem caching anymore and never check for message expiration.

For we neither use message expiration nor message priorities and the current message dispatching is fast enough for us, this trade-off is acceptable regarding given system limitations.

One should also think about well-defined prefetch limits for memory consumption during specific workflows. Message sizes in our scenario can be 2 Bytes up to approx. 100 KB, so more individual policyEntries and client consumer configurations could be helpful to optimize system behaviour concerning performance and memory usage (see http://activemq.apache.org/per-destination-policies.html).