Why ZooKeeper needs majority to run?

Asked 17/4, 2013 at 13:0 Answered 6/7, 2017 at 10:23

I've been wondering why ZooKeeper needs a majority of the machines in the ensemble to work at all. Lets say we have a very simple ensemble of 3 machines - A,B,C.

When A fails, new leader is elected - fine, everything works. When another one dies, lets say B, service is unavailable. Does it make sense? Why machine C cannot handle everything alone, until A and B are up again?

Since one machine is enough to do all the work (for example single machine ensemble works fine)...

Is there any particular reason why ZooKeeper is designed in this way? Is there a way to configure ZooKeeper that, for example ensemble is available always when at least one of N is up?

Edit: Maybe there is a way to apply a custom algorithm of leader selection? Or define a size of quorum?

Thanks in advance.

Tulley answered 17/4, 2013 at 13:0 Comment(2)

Have you found a way to alter current algorithm of leader selection? I found also a little bit frustrating to pay 200K for 3 machines that were supposed to eliminate the risk of point-of-single-failure to find myself basically in the same position ... – Ias 19/9, 2014 at 16:20

@Ias No, I haven't even tried to find a workaround, because after reading answers to my question I understood approach of ZooKeeper and it seems very logical. – Hustler 26/9, 2014 at 8:32

Zookeeper is intended to distribute things reliably. If the network of systems becomes segmented, then you don't want the two halves operating independently and potentially getting out of sync, because when the failure is resolved, it won't know what to do. If you have it refuse to operate when it's got less than a majority, then you can be assured that when a failure is resolved, everything will come right back up without further intervention.

Mixtec answered 17/4, 2013 at 13:4 Comment(3)

In my example, there i no segmentation of system - only two of 3 nodes are down. If you kill any nodes, it doesn't mean that system is divided to 2/3/n smaller once - it is still the same system with less nodes. The system should be down, only when all nodes are. Even a single node should be able to handle read/write requests of clients (and it background, it should keep trying to reconnect to other nodes). ZooKeeper service with 50% of working nodes, will be down, does it make sense? If we have 100 machines, we would still have 50 nodes to handle everything... but system is not available. – Hustler 18/4, 2013 at 5:55

Replace the word "kill" with "isolate" and you'll see that in case of network failure (but machine still running) a node that finds itself to not be part of the majority is logically offline. Network failure is a more common real world scenario compare to machine shutdown. – Mazdaism 18/4, 2013 at 7:0

@MichałSzkudlarek - The thing you aren't realizing is that the Zookeeper nodes DON'T KNOW if the missing nodes are down completely, or are inaccessible due to network failure. So they behave as if it's network failure in order to avoid problems when whatever is wrong gets fixed. – Mixtec 18/4, 2013 at 10:42

The reason to get a majority vote is to avoid a problem called "split-brain".

Basically in a network failure you don't want the two parts of the system to continue as usual. you want one to continue and the other to understand that it is not part of the cluster.

There are two main ways to achieve that one is to hold a shared resource, for instance a shared disk where the leader holds a lock, if you can see the lock you are part of the cluster if you don't you're out. If you are holding the lock you're the leader and if you don't your not. The problem with this approach is that you need that shared resource.

The other way to prevent a split-brain is majority count, if you get enough votes you are the leader. This still works with two nodes (for a quorum of 3) where the leader says it is the leader and the other node acting as a "witness" also agrees. This method is preferable as it can work in a shared nothing architecture and indeed that is what Zookeeper uses

As Michael mentioned, a node cannot know if the reason it doesn't see the other nodes in the cluster is because these nodes are down or there's a network problem - the safe bet is to say there's no quorum.

Ovoid answered 19/4, 2013 at 19:37 Comment(0)

Let’s look at an example that shows how things can go wrong if the quorum (majority of running servers) is too small.

Say we have five servers and a quorum can be any set of two servers. Now say that servers s1 and s2 acknowledge that they have replicated a request to create a znode /z. The service returns to the client saying that the znode has been created. Now suppose servers s1 and s2 are partitioned away from the other servers and from clients for an arbitrarily long time, before they have a chance to replicate the new znode to the other servers. The service in this state is able to make progress because there are three servers available and it really needs only two according to our assumptions, but these three servers have never seen the new znode /z. Consequently, the request to create /z is not durable.

This is an example of the split-brain scenario. To avoid this problem, in this example the size of the quorum must be at least three, which is a majority out of the five servers in the ensemble. To make progress, the ensemble needs at least three servers available. To confirm that a request to update the state has completed successfully, this ensemble also requires that at least three servers acknowledge that they have replicated it.

Hydra answered 6/7, 2017 at 10:23 Comment(0)

Recommended topics

Hot tags