Why can't my Zookeeper server rejoin the Quorum?

Asked 3/3, 2014 at 19:26 Answered 15/6, 2022 at 19:38

I have three servers in my quorum. They are running ZooKeeper 3.4.5. Two of them appear to be running fine based on the output from mntr. One of them was restarted a couple days ago due to a deploy, and since then has not been able to join the quorum. Some lines in the logs that stick out are:

2014-03-03 18:44:40,995 [myid:1] - INFO  [main:QuorumPeer@429] - currentEpoch not found! Creating with a reasonable default of 0. This should only happen when you are upgrading your installation

and:

2014-03-03 18:44:41,233 [myid:1] - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:QuorumCnxManager@190] - Have smaller server identifier, so dropping the connection: (2, 1)
2014-03-03 18:44:41,234 [myid:1] - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:QuorumCnxManager@190] - Have smaller server identifier, so dropping the connection: (3, 1)
2014-03-03 18:44:41,235 [myid:1] - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:FastLeaderElection@774] - Notification time out: 400

Googling for the first ('currentEpoch not found!') led me to JIRA ZOOKEEPER-1653 - zookeeper fails to start because of inconsistent epoch. It describes a bug fix but doesn't describe a way to resolve the issue without upgrading zookeeper.

Googling for the second ('Have smaller server identifier, so dropping the connection') led me to JIRA ZOOKEEPER-1506 - Re-try DNS hostname -> IP resolution if node connection fails. This makes sense because I am using AWS Elastic IPs for the servers. The fix for this issue seems to be to do a rolling restart, which would cause us to temporarily lose quorum.

It looks like the second issue is definitely in play because I see timeouts in the other ZooKeeper server's logs (the ones still in the quorum) when trying to connect to the first server. What I'm not sure of is if the first issue will disappear when I do a rolling restart. I would like to avoid upgrading and/or doing a rolling restart, but if I have to do a rolling restart I'd like to avoid doing it multiple times. Is there a way to fix the first issue without upgrading? Or even better: Is there a way to resolve both issues without doing a rolling restart?

Thanks for reading and for your help!

Murray answered 3/3, 2014 at 19:26 Comment(2)

We ended up doing a rolling restart and everything seems to be working fine now. It seems the first issue was only perceived and that the DNS caching was the real culprit. – Murray 3/3, 2014 at 23:16

Hi @fpearsall, You could move your solution to an "Answer" imo. – Lucianolucias 8/1, 2019 at 7:2

This is a bug of zookeeper: Server is unable to join quorum after connection broken to other peers Restart the leader solves this issue.

Antonietta answered 23/12, 2019 at 11:4 Comment(3)

This worked for me, although I needed to restart the other follower server as well, because it stopped working after restarting the leader. We are running on version 3.6.2, so this version is affected as well it seems. – Kimes 2/8, 2021 at 7:45

I am having the same problem on 3.6.2. I stopped a zookeeper node with id 2 and restarted it. It does not seem to want to reconnect to the cluster. The current leader appears to be in 3. The message is: Have smaller server identifier, so dropping the connection: (myId:2 --> sid:3) Is there not a way to restart a zookeeper node with a lower id than the leader? – Precipitancy 6/12, 2021 at 18:57

I think it is really annoying that this bug still exists in 2023. However, one guy figured a way to make it work by changing the hostname of the host where the zk instance runs to 0.0.0.0: issues.apache.org/jira/browse/… – Kimes 19/8, 2023 at 7:21

Everyone has this problem when your pods or hosts rejoining the cluster with different ips using the same id. For your host your Ip could change because specify in your config perhaps 0.0.0.0 or domains name. So Follow these instructions:

1.stop all server, and in config use

server.1=10.x.x.x:1234:5678
server.2=10.x.x.y:1234:5678
server.3=10.x.x.z:1234:5678

not dns name .

Use Your IP LAN as Identifier .

start your server it should work

Feudality answered 15/6, 2022 at 19:38 Comment(0)

Recommended topics

Hot tags