If zookeeper leader process is killed, should all followers get exceptions and restart too?
Asked Answered
S

1

6

I'm working on a project using Zookeeper 3.4.6, and am performing some failure mode testing. While doing so, I found (what I think is) unexpected behaviour.

Should followers restart if the leader Zookeeper process is killed?

Environment:

OS:        Windows Server 2008 R2 (hosted in a Tanuki Java service wrapper)
Zookeeper: 3.4.6
Java JDK:  1.7.0.210

Tests:

The test is to kill Zookeeper processes and make sure the cluster recovers.

If I kill a non-leader process, it restarts and rejoins the cluster without affecting other nodes.

If I kill the leader process, the leader and followers restart. This doesn't seem right, as there's a period of time where clients can't connect to any Zookeeper node.

I've tried both TCP and UDP communication settings, but both exhibit the same behaviour. UDP is twice as quick to recover though.

Zookeeper settings

tickTime=2000
initLimit=5
syncLimit=2
minSessionTimeout=5000
maxSessionTimeout=120000
dataDir=C:\\ProgramData\\Saab OneView\\ZooKeeper\\zoo-data
clientPort=2181
leaderServes=yes
autopurge.purgeInterval=24

# IP addresses blanked out here
server.1=0.0.0.1:2888:3888
server.2=0.0.0.2:2888:3888
server.3=0.0.0.3:2888:3888
server.4=0.0.0.4:2888:3888
server.5=0.0.0.5:2888:3888

# This is for zookeeper->zookeeper communication
# I've tried both settings, UDP has faster recovery time
# 0 = UDP 
# 3 = TCP (default)
electionAlg=3

Sample follower exception causing shutdown

20160309 05:35:51.958Z 20160309 05:35:51.958 [myid:3] - WARN  [RecvWorker:4:QuorumCnxManager$RecvWorker@780] - Connection broken for id 4, my id = 3, error = 
java.net.SocketException: Connection reset
    at java.net.SocketInputStream.read(Unknown Source)
    at java.net.SocketInputStream.read(Unknown Source)
    at java.net.SocketInputStream.read(Unknown Source)
    at java.io.DataInputStream.readInt(Unknown Source)
    at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:765)
20160309 05:35:51.959Z 20160309 05:35:51.959 [myid:3] - WARN  [RecvWorker:4:QuorumCnxManager$RecvWorker@783] - Interrupting SendWorker
20160309 05:35:51.959Z 20160309 05:35:51.959 [myid:3] - WARN  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when following the leader
java.net.SocketException: Connection reset
    at java.net.SocketInputStream.read(Unknown Source)
    at java.net.SocketInputStream.read(Unknown Source)
    at java.io.BufferedInputStream.fill(Unknown Source)
    at java.io.BufferedInputStream.read(Unknown Source)
    at java.io.DataInputStream.readInt(Unknown Source)
    at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
    at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
    at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
    at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
    at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
    at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
20160309 05:35:51.960Z 20160309 05:35:51.960 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
java.lang.Exception: shutdown Follower
    at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
    at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:790)
Seals answered 10/3, 2016 at 1:3 Comment(0)
L
1

Based on ZOOKEEPER-3478 it is an expected behaviour:

It is normal behaviour that all the followers shutdown during a leader election. Since there is no leader after a leader crash, the servers that used to be followers are not followers anymore. So the followers shutdown and go back to LOOKING state in order to find the new leader.

Lilli answered 4/1, 2022 at 10:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.