apacheluceneapache-zookeepersolrcloud

zookeeper node fails to communicate with Leader node


I have a zookeeper cluster which includes 3 nodes. Zookeeper config is mentioned below. While restarting it shows a success message but it shows status as failure.

zoo.cfg

dataDir=/ngs/app/<app>/zookeeper-3.4.6/zookeeperdata/1
clientPort=2181
initLimit=5
syncLimit=2
server.1=pr2-ligerp-lapp27.<domain.com>:2888:3888
server.2=pr2-ligerp-lapp28.<domain.com>:2889:3889
server.3=pr2-ligerp-lapp29.<domain.com>:2890:3890

Please find the logs below:

sh zkServer.sh start

JMX enabled by default
Using config: /ngs/app/ligerp/solr/zookeeper-3.4.6/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
-bash-4.1$ 
-bash-4.1$ cat zookeeper.out 
2017-04-18 18:58:13,840 [myid:] - INFO  [main:QuorumPeerConfig@103] - Reading configuration from: /ngs/app/ligerp/solr/zookeeper-3.4.6/bin/../conf/zoo.cfg
2017-04-18 18:58:13,843 [myid:] - INFO  [main:QuorumPeerConfig@340] - Defaulting to majority quorums
2017-04-18 18:58:13,845 [myid:1] - INFO  [main:DatadirCleanupManager@78] - autopurge.snapRetainCount set to 3
2017-04-18 18:58:13,845 [myid:1] - INFO  [main:DatadirCleanupManager@79] - autopurge.purgeInterval set to 0
2017-04-18 18:58:13,846 [myid:1] - INFO  [main:DatadirCleanupManager@101] - Purge task is not scheduled.
2017-04-18 18:58:13,854 [myid:1] - INFO  [main:QuorumPeerMain@127] - Starting quorum peer
2017-04-18 18:58:13,861 [myid:1] - INFO  [main:NIOServerCnxnFactory@94] - binding to port 0.0.0.0/0.0.0.0:2181
2017-04-18 18:58:13,875 [myid:1] - INFO  [main:QuorumPeer@959] - tickTime set to 3000
2017-04-18 18:58:13,875 [myid:1] - INFO  [main:QuorumPeer@979] - minSessionTimeout set to -1
2017-04-18 18:58:13,875 [myid:1] - INFO  [main:QuorumPeer@990] - maxSessionTimeout set to -1
2017-04-18 18:58:13,875 [myid:1] - INFO  [main:QuorumPeer@1005] - initLimit set to 5
2017-04-18 18:58:13,884 [myid:1] - INFO  [main:FileSnap@83] - Reading snapshot /ngs/app/ligerp/solr/zookeeper-3.4.6/zookeeperdata/1/version-2/snapshot.1300000032
2017-04-18 18:58:13,954 [myid:1] - INFO  [Thread-1:QuorumCnxManager$Listener@504] - My election bind port: pr2-ligerp-lapp27.<domain>/10.136.145.38:3888
2017-04-18 18:58:13,960 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumPeer@714] - LOOKING
2017-04-18 18:58:13,961 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@815] - New election. My id =  1, proposed zxid=0x130000024b
2017-04-18 18:58:13,962 [myid:1] - INFO  [WorkerReceiver[myid=1]:FastLeaderElection@597] - Notification: 1 (message format version), 1 (n.leader), 0x130000024b (n.zxid), 0x1 (n.round), LOOKING (n.state), 1 (n.sid), 0x13 (n.peerEpoch) LOOKING (my state)
2017-04-18 18:58:13,964 [myid:1] - INFO  [WorkerSender[myid=1]:QuorumCnxManager@193] - Have smaller server identifier, so dropping the connection: (2, 1)
2017-04-18 18:58:13,964 [myid:1] - INFO  [WorkerSender[myid=1]:QuorumCnxManager@193] - Have smaller server identifier, so dropping the connection: (3, 1)
2017-04-18 18:58:14,165 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@193] - Have smaller server identifier, so dropping the connection: (2, 1)
2017-04-18 18:58:14,166 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@193] - Have smaller server identifier, so dropping the connection: (3, 1)
2017-04-18 18:58:14,166 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - Notification time out: 400
2017-04-18 18:58:15,566 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@193] - Have smaller server identifier, so dropping the connection: (2, 1)
2017-04-18 18:58:15,567 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@193] - Have smaller server identifier, so dropping the connection: (3, 1)
2017-04-18 18:58:15,567 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - Notification time out: 800
2017-04-18 18:58:16,368 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@193] - Have smaller server identifier, so dropping the connection: (2, 1)
2017-04-18 18:58:16,368 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@193] - Have smaller server identifier, so dropping the connection: (3, 1)
2017-04-18 18:58:16,368 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - Notification time out: 1600
2017-04-18 18:58:17,969 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@193] - Have smaller server identifier, so dropping the connection: (2, 1)
2017-04-18 18:58:17,969 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@193] - Have smaller server identifier, so dropping the connection: (3, 1)
2017-04-18 18:58:17,970 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - Notification time out: 3200

But after checking status, we found that it's not running but we can still see the process id. Could someone help fixing this issue.

sh zkServer.sh status

JMX enabled by default
Using config: /ngs/app/ligerp/solr/zookeeper-3.4.6/bin/../conf/zoo.cfg
Error contacting service. It is probably not running.

This is the exception in the leader node of zookeeper:

2017-04-18 18:25:32,634 [myid:3] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2183:NIOServerCnxnFactory@197] - Accepted socket connection from /127.0.0.1:47916
2017-04-18 18:25:32,635 [myid:3] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2183:NIOServerCnxn@827] - Processing srvr command from /127.0.0.1:47916
2017-04-18 18:25:32,635 [myid:3] - INFO  [Thread-22:NIOServerCnxn@1007] - Closed socket connection for client /127.0.0.1:47916 (no session established for client)
2017-04-18 18:30:01,662 [myid:3] - WARN  [RecvWorker:1:QuorumCnxManager$RecvWorker@780] - Connection broken for id 1, my id = 3, error = 
java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:765)
2017-04-18 18:30:01,663 [myid:3] - WARN  [RecvWorker:1:QuorumCnxManager$RecvWorker@783] - Interrupting SendWorker
2017-04-18 18:30:01,662 [myid:3] - ERROR [LearnerHandler-/10.136.145.38:47656:LearnerHandler@633] - Unexpected exception causing shutdown while sock still open
java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
    at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
    at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
    at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
2017-04-18 18:30:01,663 [myid:3] - WARN  [SendWorker:1:QuorumCnxManager$SendWorker@697] - Interrupted while waiting for message on queue
java.lang.InterruptedException
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
    at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
    at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:849)
    at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:64)
    at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:685)
2017-04-18 18:30:01,663 [myid:3] - WARN  [LearnerHandler-/10.136.145.38:47656:LearnerHandler@646] - ******* GOODBYE /10.136.145.38:47656 ********
2017-04-18 18:30:01,663 [myid:3] - WARN  [SendWorker:1:QuorumCnxManager$SendWorker@706] - Send worker leaving thread
2017-04-18 18:39:40,076 [myid:3] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2183:NIOServerCnxnFactory@197] - Accepted socket connection from /10.136.145.38:58748
2017-04-18 18:39:40,077 [myid:3] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2183:NIOServerCnxn@357] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x0, likely client has closed socket
    at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
    at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
    at java.lang.Thread.run(Thread.java:745)
2017-04-18 18:39:40,078 [myid:3] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2183:NIOServerCnxn@1007] - Closed socket connection for client /10.136.145.38:58748 (no session established for client)
2017-04-18 18:42:46,516 [myid:3] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2183:NIOServerCnxnFactory@197] - Accepted socket connection from /127.0.0.1:47988
2017-04-18 18:42:46,516 [myid:3] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2183:NIOServerCnxn@827] - Processing srvr command from /127.0.0.1:47988
2017-04-18 18:42:46,517 [myid:3] - INFO  [Thread-23:NIOServerCnxn@1007] - Closed socket connection for client /127.0.0.1:47988 (no session established for client)

Solution

  • Fixed the issue by making below changes in zoo.cfg of all the zookeeper nodes:

    1. changes the zookeeper hostname to IP address
    2. started zookeeper instances in a sequence of descending id.
    3. increased initSize from 5 to 100

    A sample zoo.cfg looks like this:

    dataDir=/ngs/app/ligerp/solr/zookeeper-3.4.6/zookeeperdata/1
    clientPort=2181
    initLimit=100
    syncLimit=2
    server.1=10.136.145.38:2888:3888
    server.2=10.136.145.39:2889:3889
    server.3=10.136.145.40:2890:3890