I am working on configuring an ActiveMQ Artemis cluster to serve around 100k connections. My cluster involves 6 brokers with config copied below. All brokers are running on dedicated VM (14cpu, 20GB memory).
I am evaluating HiveMQ enterprise and ActiveMQ Artemis to find out which one suits our expectations. HiveMQ Swarm is being used to create connections to broker cluster. I am getting around 50k connections in 8 minutes, which is very low than my expectation and also compared to HiveMQ. What can be changed in order to make cluster more performant? On client side error message is "connection timed out" while on broker side logs have error:
java.lang.IllegalStateException: AMQ850000: Unable to store MQTT state within given timeout: 5000ms
at org.apache.activemq.artemis.core.protocol.mqtt.MQTTStateManager.storeSessionState(MQTTStateManager.java:177) ~[artemis-mqtt-protocol-2.31.0.jar:2.31.0]
at org.apache.activemq.artemis.core.protocol.mqtt.MQTTSubscriptionManager.removeSubscriptions(MQTTSubscriptionManager.java:291) ~[artemis-mqtt-protocol-2.31.0.jar:2.31.0]
at org.apache.activemq.artemis.core.protocol.mqtt.MQTTSubscriptionManager.clean(MQTTSubscriptionManager.java:368) ~[artemis-mqtt-protocol-2.31.0.jar:2.31.0]
<?xml version='1.0'?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<configuration xmlns="urn:activemq"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xi="http://www.w3.org/2001/XInclude"
xsi:schemaLocation="urn:activemq /schema/artemis-configuration.xsd">
<core xmlns="urn:activemq:core" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:activemq:core ">
<name>0.0.0.0</name>
<persistence-enabled>true</persistence-enabled>
<!-- It is recommended to keep this value as 1, maximizing the number of records stored about redeliveries.
However if you must preserve state of individual redeliveries, you may increase this value or set it to -1 (infinite). -->
<max-redelivery-records>1</max-redelivery-records>
<!-- this could be ASYNCIO, MAPPED, NIO
ASYNCIO: Linux Libaio
MAPPED: mmap files
NIO: Plain Java Files
-->
<journal-type>ASYNCIO</journal-type>
<paging-directory>data/paging</paging-directory>
<bindings-directory>data/bindings</bindings-directory>
<journal-directory>data/journal</journal-directory>
<large-messages-directory>data/large-messages</large-messages-directory>
<!-- if you want to retain your journal uncomment this following configuration.
This will allow your system to keep 7 days of your data, up to 10G. Tweak it accordingly to your use case and capacity.
it is recommended to use a separate storage unit from the journal for performance considerations.
<journal-retention-directory period="7" unit="DAYS" storage-limit="10G">data/retention</journal-retention-directory>
You can also enable retention by using the argument journal-retention on the `artemis create` command -->
<journal-datasync>true</journal-datasync>
<journal-min-files>2</journal-min-files>
<journal-pool-files>10</journal-pool-files>
<!--<thread-pool-max-size>40</thread-pool-max-size>-->
<journal-device-block-size>8192</journal-device-block-size>
<journal-file-size>100M</journal-file-size>
<!--
This value was determined through a calculation.
Your system could perform 19.23 writes per millisecond
on the current journal configuration.
That translates as a sync write every 52000 nanoseconds.
Note: If you specify 0 the system will perform writes directly to the disk.
We recommend this to be 0 if you are using journalType=MAPPED and journal-datasync=false.
-->
<journal-buffer-timeout>52000</journal-buffer-timeout>
<!--
When using ASYNCIO, this will determine the writing queue depth for libaio.
-->
<journal-max-io>4096</journal-max-io>
<!--
You can verify the network health of a particular NIC by specifying the <network-check-NIC> element.
<network-check-NIC>theNicName</network-check-NIC>
-->
<!--
Use this to use an HTTP server to validate the network
<network-check-URL-list>http://www.apache.org</network-check-URL-list> -->
<!-- <network-check-period>10000</network-check-period> -->
<!-- <network-check-timeout>1000</network-check-timeout> -->
<!-- this is a comma separated list, no spaces, just DNS or IPs
it should accept IPV6
Warning: Make sure you understand your network topology as this is meant to validate if your network is valid.
Using IPs that could eventually disappear or be partially visible may defeat the purpose.
You can use a list of multiple IPs, and if any successful ping will make the server OK to continue running -->
<!-- <network-check-list>10.0.0.1</network-check-list> -->
<!-- use this to customize the ping used for ipv4 addresses -->
<!-- <network-check-ping-command>ping -c 1 -t %d %s</network-check-ping-command> -->
<!-- use this to customize the ping used for ipv6 addresses -->
<!-- <network-check-ping6-command>ping6 -c 1 %2$s</network-check-ping6-command> -->
<!-- how often we are looking for how many bytes are being used on the disk in ms -->
<disk-scan-period>5000</disk-scan-period>
<!-- once the disk hits this limit the system will block, or close the connection in certain protocols
that won't support flow control. -->
<max-disk-usage>90</max-disk-usage>
<!-- should the broker detect dead locks and other issues -->
<critical-analyzer>true</critical-analyzer>
<critical-analyzer-timeout>120000</critical-analyzer-timeout>
<critical-analyzer-check-period>60000</critical-analyzer-check-period>
<critical-analyzer-policy>HALT</critical-analyzer-policy>
<page-sync-timeout>288000</page-sync-timeout>
<!-- the system will enter into page mode once you hit this limit. This is an estimate in bytes of how much the messages are using in memory
The system will use half of the available memory (-Xmx) by default for the global-max-size.
You may specify a different value here if you need to customize it to your needs.
<global-max-size>100Mb</global-max-size> -->
<!-- the maximum number of messages accepted before entering full address mode.
if global-max-size is specified the full address mode will be specified by whatever hits it first. -->
<global-max-messages>-1</global-max-messages>
<acceptors>
<!-- useEpoll means: it will use Netty epoll if you are on a system (Linux) that supports it -->
<!-- amqpCredits: The number of credits sent to AMQP producers -->
<!-- amqpLowCredits: The server will send the # credits specified at amqpCredits at this low mark -->
<!-- amqpDuplicateDetection: If you are not using duplicate detection, set this to false
as duplicate detection requires applicationProperties to be parsed on the server. -->
<!-- amqpMinLargeMessageSize: Determines how many bytes are considered large, so we start using files to hold their data.
default: 102400, -1 would mean to disable large message control -->
<!-- Note: If an acceptor needs to be compatible with HornetQ and/or Artemis 1.x clients add
"anycastPrefix=jms.queue.;multicastPrefix=jms.topic." to the acceptor url.
See https://issues.apache.org/jira/browse/ARTEMIS-1644 for more information. -->
<acceptor name= "artemis">tcp://172.28.104.100:61616?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;serverKeepAlive=-1;amqpMinLargeMessageSize=102400;protocols=CORE,AMQP,STOMP,HORNETQ,MQTT,OPENWIRE;useEpoll=true;amqpCredits=1000;amqpLrnalManagementObjects=false</acceptor>
<!-- AMQP Acceptor. Listens on default AMQP port for AMQP traffic.-->
<acceptor name="amqp">tcp://0.0.0.0:5672?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;protocols=AMQP;useEpoll=true;amqpCredits=1000;amqpLowCredits=300;amqpMinLargeMessageSize=102400;amqpDuplicateDetection=true</acceptor>
<!-- STOMP Acceptor. -->
<acceptor name="stomp">tcp://0.0.0.0:61613?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;protocols=STOMP;useEpoll=true</acceptor>
<!-- HornetQ Compatibility Acceptor. Enables HornetQ Core and STOMP for legacy HornetQ clients. -->
<acceptor name="hornetq">tcp://0.0.0.0:5445?anycastPrefix=jms.queue.;multicastPrefix=jms.topic.;protocols=HORNETQ,STOMP;useEpoll=true</acceptor>
<!-- MQTT Acceptor -->
<acceptor name="mqtt">tcp://0.0.0.0:1883?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;protocols=MQTT;useEpoll=true</acceptor>
</acceptors>
<security-settings>
<security-setting match="#">
<permission type="createNonDurableQueue" roles="amq"/>
<permission type="deleteNonDurableQueue" roles="amq"/>
<permission type="createDurableQueue" roles="amq"/>
<permission type="deleteDurableQueue" roles="amq"/>
<permission type="createAddress" roles="amq"/>
<permission type="deleteAddress" roles="amq"/>
<permission type="consume" roles="amq"/>
<permission type="browse" roles="amq"/>
<permission type="send" roles="amq"/>
<!-- we need this otherwise ./artemis data imp wouldn't work -->
<permission type="manage" roles="amq"/>
</security-setting>
</security-settings>
<address-settings>
<!-- if you define auto-create on certain queues, management has to be auto-create -->
<address-setting match="activemq.management#">
<dead-letter-address>DLQ</dead-letter-address>
<expiry-address>ExpiryQueue</expiry-address>
<redelivery-delay>0</redelivery-delay>
<!-- with -1 only the global-max-size is in use for limiting -->
<max-size-bytes>-1</max-size-bytes>
<message-counter-history-day-limit>10</message-counter-history-day-limit>
<address-full-policy>PAGE</address-full-policy>
<auto-create-queues>true</auto-create-queues>
<auto-create-addresses>true</auto-create-addresses>
</address-setting>
<!--default for catch all-->
<address-setting match="#">
<dead-letter-address>DLQ</dead-letter-address>
<expiry-address>ExpiryQueue</expiry-address>
<redelivery-delay>0</redelivery-delay>
<message-counter-history-day-limit>10</message-counter-history-day-limit>
<address-full-policy>PAGE</address-full-policy>
<auto-create-queues>true</auto-create-queues>
<auto-create-addresses>true</auto-create-addresses>
<auto-delete-queues>false</auto-delete-queues>
<auto-delete-addresses>false</auto-delete-addresses>
<!-- The size of each page file -->
<page-size-bytes>10M</page-size-bytes>
<!-- When we start applying the address-full-policy, e.g paging -->
<!-- Both are disabled by default, which means we will use the global-max-size/global-max-messages -->
<max-size-bytes>-1</max-size-bytes>
<max-size-messages>-1</max-size-messages>
<!-- When we read from paging into queues (memory) -->
<max-read-page-messages>-1</max-read-page-messages>
<max-read-page-bytes>20M</max-read-page-bytes>
<!-- Limit on paging capacity before starting to throw errors -->
<page-limit-bytes>-1</page-limit-bytes>
<page-limit-messages>-1</page-limit-messages>
</address-setting>
</address-settings>
<addresses>
<address name="DLQ">
<anycast>
<queue name="DLQ" />
</anycast>
</address>
<address name="ExpiryQueue">
<anycast>
<queue name="ExpiryQueue" />
</anycast>
</address>
</addresses>
<connectors>
<connector name="artemis">tcp://172.28.104.100:61616</connector>
</connectors>
<broadcast-groups>
<broadcast-group name="gss-broadcast">
<group-address>${udp-address:231.7.7.7}</group-address>
<group-port>9876</group-port>
<broadcast-period>100</broadcast-period>
<connector-ref>artemis</connector-ref>
</broadcast-group>
</broadcast-groups>
<discovery-groups>
<discovery-group name="gss-discovery">
<group-address>${udp-address:231.7.7.7}</group-address>
<group-port>9876</group-port>
<refresh-timeout>10000</refresh-timeout>
</discovery-group>
</discovery-groups>
<ha-policy>
<replication>
<master>
<check-for-live-server>true</check-for-live-server>
</master>
</replication>
</ha-policy>
<cluster-user>cluster-user</cluster-user>
<cluster-password>cluster</cluster-password>
<cluster-connections>
<cluster-connection name="gss-cluster">
<connector-ref>artemis</connector-ref>
<use-duplicate-detection>true</use-duplicate-detection>
<message-load-balancing>ON_DEMAND</message-load-balancing>
<max-hops>1</max-hops>
<!--address>jms</address>
<retry-interval>500</retry-interval>-->
<!--<static-connectors>
<connector-ref>artemis</connector-ref>
</static-connectors> -->
<discovery-group-ref discovery-group-name="gss-discovery"/>
</cluster-connection>
</cluster-connections>
<metrics>
<jvm-memory>true</jvm-memory> <!-- defaults to true -->
<jvm-gc>true</jvm-gc> <!-- defaults to false -->
<jvm-threads>true</jvm-threads> <!-- defaults to false -->
<file-descriptors>true</file-descriptors> <!-- defaults to false -->
<processor>true</processor> <!-- defaults to false -->
<uptime>true</uptime> <!-- defaults to false -->
<plugin class-name="com.redhat.amq.broker.core.server.metrics.plugins.ArtemisPrometheusMetricsPlugin"/>
</metrics>
</core>
</configuration>
I have tried to increase JVM thread pool size, max memory, and min memory.
In order to support persistent sessions (which is part of the MQTT specification) some state is written to disk on every connection. If your storage mechanism is slow then it can bottleneck.
You can disable this functionality by adding this to the <addresses>
block in broker.xml
:
<address name="$sys.mqtt.sessions">
<anycast>
<queue name="$sys.mqtt.sessions">
<durable>false</durable>
</queue>
</anycast>
</address>
This will increase performance, but it means that existing MQTT subscribers will have to re-subscribe when they reconnect after a broker restart rather than having their subscriptions automatically restored. This may be perfectly fine for you use-case (e.g. using clean sessions) or it may cause problems if the clients expect the subscriptions to be automatically restored on reconnect.
You can also disable persistence completely using this:
<persistence-enabled>false</persistence-enabled>
However, this is generally not recommended as it means no messages will survive a broker restart.
It's worth noting that your disk seems slow based on the calculated journal-buffer-timeout
. When the broker is created it performs a load calculation on the disk to determine the value for journal-buffer-timeout
in broker.xml
. As noted in the related comment:
Your system could perform 19.23 writes per millisecond.
The SSD on my laptop can support 250 writes per millisecond - an order of magnitude more. It's possible that there was some activity on your disk when the calculation was made so you can recalculate it using the bin/artemis perf-journal
command and modifying broker.xml
with the new value. However, if the new value isn't significantly larger than the old value then it won't make much difference.
This is changing significantly in 2.42.0 due to ARTEMIS-5499. The possibility of a timeout will be eliminated completely, and, if necessary, you will be able to disable subscription persistence by setting mqtt-subscription-persistence-enabled
to false
in broker.xml
.