javalow-latencyjgroups

How can we provide low latency transmission of messages using jGroups?


We have implemented message broadcasting between instances of the same API using jGroups to synchronize application state. We use jGroups5.2.4.Final on JDK 11.0.17 on Linux 5.4.0. Currently we observe that some messages take around 7 seconds (in some cases even up to 50s) to be transmitted from one API to another and we'd like to achieve <1s latencies.

This issue occurs even when both API instances run on the same machine.

To send we use

channel = new org.jgroups.JChannel(networkConfig.getInputStream()); // channel created once at startup of api
channel.setName("XNetwork");
channel.connect("XCluster");
channel.setReceiver(receiver);
channel.addChannelListener(this);
channel.setDiscardOwnMessages(true);
...
Message message = new org.jgroups.BytesMessage(null, command);
channel.send(message); // executed when needed to synchronize state

with the following configuration networkConfig:

 <config xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xmlns="urn:org:jgroups"
        xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/jgroups.xsd">
      <TCP
          external_addr="match-interface:eth0"
          bind_addr="site_local,match-interface:eth0"
          bind_port="${jgroups.tcp_bind_port:7800}"
          recv_buf_size="5M"
          send_buf_size="1M"
          thread_naming_pattern="cl"
          thread_pool.min_threads="0"
          thread_pool.max_threads="500"
          thread_pool.keep_alive_time="30000" />
      <TCPPING async_discovery="true"
             initial_hosts="${jgroups.tcpping.initial_hosts:localhost[7800],localhost[7801]}"
             return_entire_cache="${jgroups.tcpping.return_entire_cache:false}"
             port_range="${jgroups.tcp.port_range:2}"/>
      <MERGE3 max_interval="30000"
              min_interval="10000"/>
      <FD_SOCK2/>
      <FD_ALL3 timeout="30000" interval="5000"/>
      <VERIFY_SUSPECT2 timeout="1500"  />
      <BARRIER />
      <pbcast.NAKACK2 xmit_interval="500"
                      xmit_table_num_rows="100"
                      xmit_table_msgs_per_row="2000"
                      xmit_table_max_compaction_time="30000"
                      use_mcast_xmit="false"
                      discard_delivered_msgs="true" />
      <UNICAST3
              xmit_table_num_rows="100"
              xmit_table_msgs_per_row="1000"
              xmit_table_max_compaction_time="30000"/>
      <pbcast.STABLE desired_avg_gossip="50000" max_bytes="8m"/>
      <pbcast.GMS print_local_addr="true" join_timeout="3000" />
      <UFC max_credits="2M" min_threshold="0.4"/>
      <MFC max_credits="2M" min_threshold="0.4"/>
      <FRAG2 frag_size="60K"  />
      <pbcast.STATE_TRANSFER  />
</config>

Are there any jGroups parameters we could tune to reduce the latency?

How can we troubleshoot this latency issue to determine the root cause? e.g. How can we find out where in the network stack any latency is coming from?

We tried changing the max_bundle_size unsuccessfully. Setting the DONT_BUNDLE on the message flag reduced the time slightly, but not enough. We also played with the <FRAG2> and <FRAG4> tags without any success. We expect the message to be delivered within less than 1s.


Solution

  • There is an open issue in the JGroups tracker to establish a JGroups config optimized for low latency: https://issues.redhat.com/browse/JGRP-2601 This issue mentions that by default JGroups is biased towards high throughput and links to a working document for this issue: https://github.com/belaban/JGroups/blob/master/doc/design/LatencyVersusThroughput.txt

    The listed recommendations in this document are:

    Recommendations
    ---------------
    
    * General
      * Use a JDK >= 15. Changes to the networking code in 15 improved performance for TCP (and UDP datagram sockets)
        a lot: http://belaban.blogspot.com/2020/07/double-your-performance-virtual-threads.html
      * Use virtual threads (set use_virtual_threads=true in the transport). This reduces context switching when more
        threads than cores are used.
      * Remove TIME when up/down measurements are not needed anymore; every message sent/received requires a
        System.nanoTime(), slowing things down a bit
    
    * TCP:
      * Low latency:
        * buffered_output_stream_size=0 (or a low value)
        * bundler_type="no-bundler"
          * Investigate: use "per-destination" bunder_type, but also use OOB|DONT_BUNDLE
    
      * High throughput: set buffered_output_stream to a high value (e.g. 65k)
    
    * UNICAST3
      * ack_threshold: <investigate>
    

    After upgrading to JDK 17 we indeed noticed that our latency issues were fixed.