ZooKeeper internal behavior on split brain scenario
Asked Answered
K

2

17

I am trying to understand the internal workings of Apache ZooKeeper in split brain situations. Suppose there is a cluster of 5 servers: A, B, C, D and E, where A is the leader. Now suppose the subcluster {A, B} gets separated from the the subcluster {C, D, E}.

In this case the subcluster {C, D, E} can elect a new leader and can make progress. On the other hand {A, B} cannot make progress, since there is no majority of nodes to acknowledge updates.

I'm wondering:

  1. What happens to the old leader A? I expect that it loses leadership, but how does this happen? The active leader has some periodic check to make sure it has a majority of followers?

  2. What happens to the clients that were connected to A and B? Will they be automatically redirected to one of the servers that can still make progress (C, D, or E). Or rather they are stuck with A or B, until the split brain situation is healed and the entire cluster is reconnected?

Thanks, Gabriel

Kamenskuralski answered 27/1, 2014 at 12:16 Comment(0)
K
10

After some experimenting with a local cluster, I think I figured out the behavior.

I started a local cluster of 5 nodes, and then took down 2 of the nodes. The remaining 3 nodes still formed a majority, so the cluster was up and running. I connected a client at this point.

Then I took down another server, at which point the remaining 2 nodes cannot maintain the cluster up and running.

1) In the log of one of the two remaining nodes (which happened to be the leader), I could see:

[myid:5] - WARN [RecvWorker:3:QuorumCnxManager$RecvWorker@762] - Connection broken for id 3, my id = 5, error = java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:747)

and later

[myid:5] - INFO [QuorumPeer[myid=5]/127.0.0.1:2185:FastLeaderElection@740] - New election. My id = 5, proposed zxid=0x300000002

So it seems that the nodes are actively monitoring connectivity and react to dropped connections (in this case try to elect another leader).

2) In the logs of the connected client, I could see:

[myid:] - INFO [main-SendThread(localhost:2185):ClientCnxn$SendThread@966] - Opening socket connection to server localhost/127.0.0.1:2185. Will not attempt to authenticate using SASL (unknown error)

[myid:] - INFO [main-SendThread(localhost:2185):ClientCnxn$SendThread@849] - Socket connection established to localhost/127.0.0.1:2185, initiating session

[myid:] - INFO [main-SendThread(localhost:2185):ClientCnxn$SendThread@1085] - Unable to read additional data from server sessionid 0x343d9a80f220000, likely server has closed socket, closing socket connection and attempting reconnect

So the node closes the connection opened by the client, due to the fact that the cluster is down.

In this case the entire cluster was down so the client keeps on trying to connect to one of the nodes, without luck. But I assume that in the case of a split-brain scenario, when the majority is still up and running somewhere, the client will eventually be able to connect to it (given that it has network connectivity, of course).

Kamenskuralski answered 28/1, 2014 at 17:12 Comment(1)
So in the 2 node scenario {A,B} did a leader get elected finally? I see in many places on internet that lower SID [eg: myid:5] will be preferred when no of votes are same across multiple leadersMacadamia
M
2

According to the paper ZAB, https://marcoserafini.github.io/papers/zab.pdf, which is the underlying atomic broadcast protocol Zookeeper uses. An election gets triggered by two scenarios. One, when a follower can't contact the leader after a timeout. Two, when the leader realizes it does not have the quorum support and the leader realizes that when it's trying to commit a proposal.

Back to the scenario you have. When one node in {C,D,E} realizes the leader is gone, it triggers an election. Once a new leader is elected, it starts serving client requests normally. The other partitioned cluster {A,B} won't serve any client request and will be stuck in the election until the partition is resolved. When it is resolved, one final election is held and the whole cluster is now functioning normally again.

Metalloid answered 19/11, 2019 at 15:32 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.