apache-zookeeperapache-curator

Timeout configurations in Curator


I create a Curator client as follows:

    RetryPolicy retryPolicy = new RetryNTimes(3, 1000);
    CuratorFramework client = CuratorFrameworkFactory.newClient(zkConnectString, 
            15000, // sessionTimeoutMs
            15000, // connectionTimeoutMs
            retryPolicy);

When running my client program I simulate a network partition by bringing down the NIC that Curator is using to communicate with Zookeeper. I have a few questions based on the behavior that I am seeing:

  1. I see a ConnectionStateManager - State change: SUSPENDED message after 10 seconds. Is the amount of time until Curator enters the SUSPENDED state configurable, based on a percentage of the other timeout values, or always 10 seconds?
  2. I do not receive any notification after the configured 15-second session timeout has passed since the last successful heartbeat. I do see a ZooKeeper - Session: 0x14adf3f01ef0001 closed message in the log, however this does not appear to trickle up as an event that I can capture or listen on. Am I missing something here?
  3. I eventually receive a ConnectionStateManager - State change: LOST message almost two minutes after the connection loss. Why so long?
  4. If my goal is to use an InterProcessMutex as a means of preventing split-brain in an HA scenario, it seems that the safest approach is for the lock holder to assume that it has lost the lock when the SUSPENDED message is received, since it is entirely possible that Zookeeper has released the lock unbeknownst to it on the other side of the network partition. Is this a typical/sane approach?

Solution

  • Correct. Assume leadership has been lost on SUSPEND and LOST. This is the way the Apache Curator recipes work. You may want to use Apache Curator rather than implementing your own algorithm. https://curator.apache.org/curator-recipes/index.html