I am using jgroups-4.0.12.Final.jar with infinispan-core-9.3.0.
And facing an issue where the cluster view is not recovering after network failure.
Cluster A has single node.
Cluster B - 3 nodes Node X - coordinator Node Y and Node Z are up.
Initial VIEW {A, X} at cluster A node and Cluster B node X. Cluster B node has tcp connection established with Node X in Cluster B.
Now when there is network glitch experienced by Node X, which last for 2 mins, following is happening.
TCP connection is removed. Cluster A VIEW {A} Cluster B Node X VIEW {X}
Node Y established tcp connection with Cluster A node.
Cluster A VIEW {A, Y} Cluster B Node Y VIEW {A, Y}
Now here I am expecting the Node X to send JOIN_REQ to Cluster A or may be a merge? But its not happening unless I restart the node X. And the I see node Y becoming permanent coordinator.
Intracluster cluster B config
<config xmlns="urn:org:jgroups" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/jgroups-4.0.xsd">
<TCP bind_addr="X" bind_port="7900" enable_diagnostics="false" max_bundle_size="64K" port_range="1" recv_buf_size="20M" send_buf_size="640K" sock_conn_timeout="300" thread_naming_pattern="pl" thread_pool.enabled="true" thread_pool.keep_alive_time="60000" thread_pool.max_threads="60" thread_pool.min_threads="2"/>
<TCPPING ergonomics="false" initial_hosts="X[8700],Y[8700],Z[8700]" port_range="1"/>
<MERGE3 max_interval="30000" min_interval="10000"/>
<FD_SOCK port_range="1" start_port="8702"/>
<FD max_tries="3" timeout="15000"/>
<VERIFY_SUSPECT num_msgs="2" timeout="10000"/>
<pbcast.NAKACK2 discard_delivered_msgs="false" max_rebroadcast_timeout="3000" use_mcast_xmit="false" xmit_interval="1000" xmit_table_max_compaction_time="10000" xmit_table_msgs_per_row="10000" xmit_table_num_rows="100"/>
<UNICAST3 conn_expiry_timeout="0" xmit_interval="500" xmit_table_max_compaction_time="10000" xmit_table_msgs_per_row="10000" xmit_table_num_rows="20"/>
<pbcast.STABLE desired_avg_gossip="50000" max_bytes="1M" stability_delay="1000"/>
<pbcast.GMS join_timeout="7000" print_local_addr="false" view_bundling="true"/>
<UFC max_credits="2m" min_threshold="0.40"/>
<MFC max_credits="2m" min_threshold="0.40"/>
<FRAG2 frag_size="30k"/>
<RSVP ack_on_delivery="false" resend_interval="500" timeout="60000"/>
<relay.RELAY2 async_relay_creation="true" config="/usr/local/symplified/etc/xsite_relay2_config.xml" relay_multicasts="false" site="EG"/>
</config>
Cluster config
<config xmlns="urn:org:jgroups" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/jgroups-4.0.xsd">
<TCP bind_addr="X node IP" bind_port="8700" enable_diagnostics="false" max_bundle_size="64000" port_range="1" recv_buf_size="20000000" send_buf_size="640000" sock_conn_timeout="300" thread_pool.enabled="true" thread_pool.keep_alive_time="5000" thread_pool.max_threads="8" thread_pool.min_threads="1"/>
<TCPPING ergonomics="false" initial_hosts="A[8700],X[8700],Y[8700],Z[8700]" port_range="0"/>
<FD_SOCK/>
<FD max_tries="3" timeout="15000"/>
<VERIFY_SUSPECT num_msgs="3" timeout="10000"/>
<pbcast.NAKACK2 discard_delivered_msgs="true" max_rebroadcast_timeout="3000" use_mcast_xmit="false" xmit_interval="1000" xmit_table_max_compaction_time="30000" xmit_table_msgs_per_row="10000" xmit_table_num_rows="100"/>
<UNICAST3 conn_expiry_timeout="0" xmit_table_max_compaction_time="30000" xmit_table_msgs_per_row="2000" xmit_table_num_rows="100"/>
<pbcast.STABLE desired_avg_gossip="50000" max_bytes="8m" stability_delay="1000"/>
<pbcast.GMS join_timeout="3000" print_local_addr="false"/>
<UFC max_credits="4M" min_threshold="0.1"/>
<MFC max_credits="4M" min_threshold="0.2"/>
<FRAG2 frag_size="60000"/>
<RSVP ack_on_delivery="false" resend_interval="500" timeout="60000"/>
</config>
Am I missing any config which can help node X resyncs its view and send JOIN_REG again to Cluster A node? Or let Node Y become permanent coordinator and have the tcp connection with Cluster A node?
<MERGE3 .../>
is missing from the second configuration which can explain why node X
isn't reconnecting again.