I'm working on a project using Zookeeper 3.4.6, and am performing some failure mode testing. While doing so, I found (what I think is) unexpected behaviour.
Should followers restart if the leader Zookeeper process is killed?
Environment:
OS: Windows Server 2008 R2 (hosted in a Tanuki Java service wrapper)
Zookeeper: 3.4.6
Java JDK: 1.7.0.210
Tests:
The test is to kill Zookeeper processes and make sure the cluster recovers.
If I kill a non-leader process, it restarts and rejoins the cluster without affecting other nodes.
If I kill the leader process, the leader and followers restart. This doesn't seem right, as there's a period of time where clients can't connect to any Zookeeper node.
I've tried both TCP and UDP communication settings, but both exhibit the same behaviour. UDP is twice as quick to recover though.
Zookeeper settings
tickTime=2000
initLimit=5
syncLimit=2
minSessionTimeout=5000
maxSessionTimeout=120000
dataDir=C:\\ProgramData\\Saab OneView\\ZooKeeper\\zoo-data
clientPort=2181
leaderServes=yes
autopurge.purgeInterval=24
# IP addresses blanked out here
server.1=0.0.0.1:2888:3888
server.2=0.0.0.2:2888:3888
server.3=0.0.0.3:2888:3888
server.4=0.0.0.4:2888:3888
server.5=0.0.0.5:2888:3888
# This is for zookeeper->zookeeper communication
# I've tried both settings, UDP has faster recovery time
# 0 = UDP
# 3 = TCP (default)
electionAlg=3
Sample follower exception causing shutdown
20160309 05:35:51.958Z 20160309 05:35:51.958 [myid:3] - WARN [RecvWorker:4:QuorumCnxManager$RecvWorker@780] - Connection broken for id 4, my id = 3, error =
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(Unknown Source)
at java.net.SocketInputStream.read(Unknown Source)
at java.net.SocketInputStream.read(Unknown Source)
at java.io.DataInputStream.readInt(Unknown Source)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:765)
20160309 05:35:51.959Z 20160309 05:35:51.959 [myid:3] - WARN [RecvWorker:4:QuorumCnxManager$RecvWorker@783] - Interrupting SendWorker
20160309 05:35:51.959Z 20160309 05:35:51.959 [myid:3] - WARN [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when following the leader
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(Unknown Source)
at java.net.SocketInputStream.read(Unknown Source)
at java.io.BufferedInputStream.fill(Unknown Source)
at java.io.BufferedInputStream.read(Unknown Source)
at java.io.DataInputStream.readInt(Unknown Source)
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
20160309 05:35:51.960Z 20160309 05:35:51.960 [myid:3] - INFO [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
java.lang.Exception: shutdown Follower
at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:790)