I'm using Apache Curator framework to create a cluster, managed by ZK.
When suddenly some node in the cluster loses its connection to ZK, it tries to reconnect with it every 5 seconds. I'm using RetryForever
policy with the specified timing for this.
While sessionTimeout
/connectionTimeouts
are not fully elapsed we're still trying to reconnect.
But even if we raised ZK and acquired connection to it within this time period, we're still getting strange messages in logs:
Thu Nov 30 20:47:51.574 GMT 2017| INFO | org.apache.zookeeper.ClientCnxn$SendThread | Socket connection established to zk_1.default/138.122.177.23:2181, initiating session |Client Details{sessionTag:{}}| localhost-startStop-1-SendThread(zk_1.default:2181)
Thu Nov 30 20:47:51.592 GMT 2017| INFO |
org.apache.zookeeper.ClientCnxn$SendThread | Unable to read additional
data from server sessionid 0x1600ea13dcd0000, likely server has closed
socket, closing socket connection and attempting reconnect |Client
Details{sessionTag:{}}|
localhost-startStop-1-SendThread(zk_1.default:2181)
Why do we still receive these messages and can't fully connect to a freshly raised ZK node?
A little bit later I have found that this error means that ZK has exhausted its maxClientCnxns
(maximum connections to ZK) parameter, but I haven't found how to configure it using Curator... Only in tests...
On ZK server side I see following errors:
2017-12-04 15:48:29,972 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /192.168.107.4:37130
2017-12-04 15:48:29,974 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@915] - Refusing session request for client /138.122.177.23:37130 as it has seen zxid 0xd our last zxid is 0x0 client must try another server
The answer is right here: http://grokbase.com/t/cloudera/cdh-user/134hrpr3nc/zookeeperserver-refusing-session-request-for-client-any-ideas
This means the client was talking to a ZK server and last saw transaction ID 0x11be4 (or 72676 in decimal). The current server to which it's trying to connect currently has transaction ID 0x3f82 (or 16258 in decimal) which is (much) lower. The assumption is that the client has seen later data than what the server could provide, potentially resulting in conflicting information. Instead of giving the client bad (outdated) info, it refuses the connection with the assumption that the client will reconnect to another server in the quorum with more up to date data, or the currently the outdated server will eventually receive a snapshot from another quorum member to catch it up.