Explaining Apache ZooKeeper

Asked 7/9, 2010 at 21:43 Answered 26/12, 2021 at 13:2

Solved apache-zookeeper distributed-computing

415

I am trying to understand ZooKeeper, how it works and what it does. Is there any application which is comparable to ZooKeeper?

If you know, then how would you describe ZooKeeper to a layman?

I have tried apache wiki, zookeeper sourceforge...but I am still not able to relate to it.

I just read thru http://zookeeper.sourceforge.net/index.sf.shtml, so aren't there more services like this? Is it as simple as just replicating a server service?

Lauter answered 7/9, 2010 at 21:43 Comment(8)

Similar to but not the exact answer you are looking for: https://mcmap.net/q/87374/-real-world-use-of-zookeeper-closed – Akihito 7/9, 2010 at 22:29

cloudera.com/blog/2009/05/… – Akihito 11/10, 2010 at 1:26

You can read this paper ZooKeeper: Wait-free coordination for Internet-scale systems Written by two Yahoo! engineers – Scrogan 7/9, 2013 at 16:51

Here is a tech talk that is an introduction to Apache ZooKeeper by Camille Fournier who is the CTO of RentTheRunway. I hope it is helpful. – Ceto 20/11, 2013 at 19:31

@Luca Geretti...Acoording to me, Zookeper provides set of apis so that we can make use of it to coordinate the distributed application. correct me if I am wrong. – Ruralize 12/8, 2014 at 7:22

~58:00 minute video. – Mezuzah 1/12, 2017 at 21:35

Check this article : stackextend.com/zookeeper/… – Carleycarli 16/1, 2018 at 14:38

this 22 min video explains very well - youtube.com/watch?v=WlkqeSstV3c – Curtcurtail 27/7, 2019 at 6:16

464

In a nutshell, ZooKeeper helps you build distributed applications.

How it works

You may describe ZooKeeper as a replicated synchronization service with eventual consistency. It is robust, since the persisted data is distributed between multiple nodes (this set of nodes is called an "ensemble") and one client connects to any of them (i.e., a specific "server"), migrating if one node fails; as long as a strict majority of nodes are working, the ensemble of ZooKeeper nodes is alive. In particular, a master node is dynamically chosen by consensus within the ensemble; if the master node fails, the role of master migrates to another node.

How writes are handled

The master is the authority for writes: in this way writes can be guaranteed to be persisted in-order, i.e., writes are linear. Each time a client writes to the ensemble, a majority of nodes persist the information: these nodes include the server for the client, and obviously the master. This means that each write makes the server up-to-date with the master. It also means, however, that you cannot have concurrent writes.

The guarantee of linear writes is the reason for the fact that ZooKeeper does not perform well for write-dominant workloads. In particular, it should not be used for interchange of large data, such as media. As long as your communication involves shared data, ZooKeeper helps you. When data could be written concurrently, ZooKeeper actually gets in the way, because it imposes a strict ordering of operations even if not strictly necessary from the perspective of the writers. Its ideal use is for coordination, where messages are exchanged between the clients.

How reads are handled

This is where ZooKeeper excels: reads are concurrent since they are served by the specific server that the client connects to. However, this is also the reason for the eventual consistency: the "view" of a client may be outdated, since the master updates the corresponding server with a bounded but undefined delay.

In detail

The replicated database of ZooKeeper comprises a tree of znodes, which are entities roughly representing file system nodes (think of them as directories). Each znode may be enriched by a byte array, which stores data. Also, each znode may have other znodes under it, practically forming an internal directory system.

Sequential znodes

Interestingly, the name of a znode can be sequential, meaning that the name the client provides when creating the znode is only a prefix: the full name is also given by a sequential number chosen by the ensemble. This is useful, for example, for synchronization purposes: if multiple clients want to get a lock on a resource, they can each concurrently create a sequential znode on a location: whoever gets the lowest number is entitled to the lock.

Ephemeral znodes

Also, a znode may be ephemeral: this means that it is destroyed as soon as the client that created it disconnects. This is mainly useful in order to know when a client fails, which may be relevant when the client itself has responsibilities that should be taken by a new client. Taking the example of the lock, as soon as the client having the lock disconnects, the other clients can check whether they are entitled to the lock.

Watches

The example related to client disconnection may be problematic if we needed to periodically poll the state of znodes. Fortunately, ZooKeeper offers an event system where a watch can be set on a znode. These watches may be set to trigger an event if the znode is specifically changed or removed or new children are created under it. This is clearly useful in combination with the sequential and ephemeral options for znodes.

Where and how to use it

A canonical example of Zookeeper usage is distributed-memory computation, where some data is shared between client nodes and must be accessed/updated in a very careful way to account for synchronization.

ZooKeeper offers the library to construct your synchronization primitives, while the ability to run a distributed server avoids the single-point-of-failure issue you have when using a centralized (broker-like) message repository.

ZooKeeper is feature-light, meaning that mechanisms such as leader election, locks, barriers, etc. are not already present, but can be written above the ZooKeeper primitives. If the C/Java API is too unwieldy for your purposes, you should rely on libraries built on ZooKeeper such as cages and especially curator.

Where to read more

Official documentation apart, which is pretty good, I suggest to read Chapter 14 of Hadoop: The Definitive Guide which has ~35 pages explaining essentially what ZooKeeper does, followed by an example of a configuration service.

Demimondaine answered 14/1, 2012 at 18:19 Comment(7)

Can I use Zookeeper as a way of communicating data between servers? Specially for a game where there are a few servers doing specific task but need to communicate to other servers like GameServer and LoginServer. – Dreda 26/6, 2014 at 3:33

I'm not sure I understand the communication scheme you are suggesting, but you can use ZooKeeper to "publish" information from a producer and have several consumers read it. If on the other hand there exists only one instance of each kind of server, there is little benefit in using ZK. – Demimondaine 27/6, 2014 at 9:59

IMO this fails to explain what ZooKeeper is to a layperson. When would I need ZooKeeper? What would I write to it? What problem does it solve? Is it a key-value store? A search engine? A distributed lock? Why would I pick ZooKeeper over e.g. Redis or a file or JIRA or post-it notes? You clearly know a lot about ZooKeeper - but can you explain it less technically? – Caco 10/1, 2017 at 21:14

As Zookeeper has linear writes, that does not stop me to use Asynchronous APIs to create nodes and take the response in a callback ? Though internally it may not allow concurrent writes , or am I missing something ? – Spelldown 15/6, 2017 at 14:26

"Each time a client writes to the ensemble, a majority of nodes persist the information: these nodes include the server for the client, and obviously the master" => could you please point me to a doc. or something where this is explained? I'm wondering whether it is possible that a state change was successfully made excluding the server to which the client is connected (in which case, the client can experience the strange behavior of not being able to read its own write for a moment) – Dhiman 22/2, 2018 at 9:21

Completely and utterly antithetical to the question asked. If it was a clock, he would be looking for "time keeping device" not a description of the mainspring, wheel train, escapement and their interaction based on the period of oscillation, moment of inertia and the impact of artificial sapphire crystals. – Infare 4/10, 2019 at 4:51

@LucaGeretti , Do you think that Apache Zookeeper can be used for executing the consensus as an external system as it is explained in the following question? https://mcmap.net/q/87375/-performing-consensus-of-a-p2p-network-outside-the-system-in-another-system-which-runs-the-consensus-by-real-machines-and-returns-the-results/5029509 – Damask 30/11, 2021 at 15:24

Zookeeper is one of the best open source server and service that helps to reliably coordinates distributed processes. Zookeeper is a CP system (Refer CAP Theorem) that provides Consistency and Partition tolerance. Replication of Zookeeper state across all the nodes makes it an eventually consistent distributed service.

Moreover, any newly elected leader will update its followers with missing proposals or with a snapshot of the state, if the followers have many proposals missing.

Zookeeper also provides an API that is very easy to use. This blog post, Zookeeper Java API examples, has some examples if you are looking for examples.

So where do we use this? If your distributed service needs a centralized, reliable and consistent configuration management, locks, queues etc, you will find Zookeeper a reliable choice.

Terceira answered 30/1, 2016 at 21:55 Comment(3)

"Zookeeper is a CP system (Refer CAP Theorem) that provides Consistency and Partition tolerance", I think that Zookeeper have master and followers, when the master down, then one of the follower would be elected as the Leader, so Zookeeper should provide the AP, however the C is eventually consistently. – Chaplain 18/12, 2017 at 6:4

In terms of CAP theorem, "C" actually means linearizability. ZooKeeper in fact provides "sequential consistency" and it means updates from clients will be applied in the order that they were recieved.. This is weaker than linearizability but is still very strong, much stronger than "eventual consistency". Zookeeper is not A and this is because If leader cannot be elected (no quorum) then zookeeper will fail requests. This is why it is not highly available. – Terceira 3/1, 2019 at 19:50

Do you think that Apache Zookeeper can be used for executing the consensus as an external system as it is explained in the following question? https://mcmap.net/q/87375/-performing-consensus-of-a-p2p-network-outside-the-system-in-another-system-which-runs-the-consensus-by-real-machines-and-returns-the-results/5029509 – Damask 30/11, 2021 at 15:24

What problem does it solve?

Let's imagine we have a million files in a file store and the file count keeps increasing every minute of the day. Our task is to first process and then delete these files. One of the approach we can think of is to write a script that does this task and run multiple instances parallelly on multiple servers. We can even increase or decrease the server count based on the demand. This is basically a distributed compute/data processing application.

Here, how can we ensure that the same file is not picked and processed by multiple servers at the same time? To solve this problem, all the servers should share the information b/w them regarding which file is currently being processed.

This is where we can use something like ZooKeeper. When the first server wants to read a file, it can write to the zookeeper the file name its going to process. Now the rest of the servers can look up ZooKeeper and know that this file is already picked up by the first server.

Above is a crude example and needs few other guard rails in place but I hope it gives an idea on what zookeeper is. ZK is basically a data store which can be accessed using the ZK API's. But it should NOT be used as a database. Only a small amount of data should be stored(usually in KB's). The upper limit is 1MB per znode. ZK is specifically built so that the distributed applications can communicate among each other.

Applications of ZK

Out of the box can be used for

storing configuration: to store configuration that is accessed across your distributed application.
naming service: store information such as service name and IP address mapping in a central place, which enables users and applications to communicate across the network.
group membership: all the applications running on distributed servers can connect to ZK and send heartbeats. If any one server/application goes down then ZK can alert other servers/applications regarding this event.

Other features have to be built on top of the ZooKeeper API.

locks and queues - useful for distributed synchronization.
two phase commits - useful when we have to commit/rollback across servers.
leader election - your distributed applications can use ZK to hold leader elections for automatic failovers.
shared counter

Below is the page that explains how these features can be implemented https://zookeeper.apache.org/doc/current/recipes.html

ZooKeeper can have many more applications. The features have to be built on top of ZK API's based on the requirements of your distributed system.

NOTE: ZK should not be used to store large amounts of data. Its not a cache/database. Use it to exchange small piece of information that your distributed applications need to start, operate and failover.

How data is stored?

Data is stored in a hierarchical tree data structure. Each node in the tree is called znode. Max size of a znode is 1MB. znodes can have data and other children znodes. Think of a znode like a folder on your computer where the folder can have files with data but also the folder itself can have data just like a file.

Why use ZK instead of our own custom service?

Atomicity and Durability
Zookeeper itself is distributed and Fault tolerant. The architecture involves one leader node and multiple follower nodes. In case a ZK follower node goes down, it will automatically failover. The client sessions are replicated hence ZK can automatically move clients to a different node. If the Leader node goes down then a new leader is elected using the ZK consensus algorithm.
Reads are very fast since its served from in-memory store.
Writes are written in the sequence in which it arrived. Hence maintains ordering.
Watches will send out notification to the client who set the watch on some data. This reduces the need to poll ZK. Note that watches are one time triggers and if you get a watch event and you want to get notified of future changes, you must set another watch.
Persistent and ephemeral znodes are available. Both are stored on ZK disks. Persistent here means that the data will be persisted once the client who created it disconnects. Ephemeral means the data will be removed automatically when the client disconnects. Ephemeral znodes are not allowed to have children.
There is also persistent sequential and ephemeral sequential znodes. Here the names of the znodes can have a suffix sequence number. similar to DB auto increment ID's, these sequence number keeps increasing and managed by ZK. This can be useful to implement queues, locks etc.

Is there any application which is comparable to Zookeeper?

etcd - https://etcd.io/docs/v3.3/learning/why/#zookeeper

Fredel answered 26/12, 2021 at 13:2 Comment(0)

I understand the ZooKeeper in general but had problems with the terms "quorum" and "split brain" so maybe I can share my findings with you (I consider myself also a layman).

Let's say we have a ZooKeeper cluster of 5 servers. One of the servers will become the leader and the others will become followers.

These 5 servers form a quorum. Quorum simply means "these servers can vote upon who should be the leader".
So the voting is based on majority. Majority simply means "more than half" so more than half of the number of servers must agree for a specific server to become the leader.
So there is this bad thing that may happen called "split brain". A split brain is simply this, as far as I understand: The cluster of 5 servers splits into two parts, or let's call it "server teams", with maybe one part of 2 and the other of 3 servers. This is really a bad situation as if both "server teams" must execute a specific order how would you decide wich team should be preferred? They might have received different information from the clients. So it is really important to know what "server team" is still relevant and which one can/should be ignored.
Majority is also the reason you should use an odd number of servers. If you have 4 servers and a split brain where 2 servers seperate then both "server teams" could say "hey, we want to decide who is the leader!" but how should you decide which 2 servers you should choose? With 5 servers it's simple: The server team with 3 servers has the majority and is allowed to select the new leader.
Even if you just have 3 servers and one of them fails the other 2 still form the majority and can agree that one of them will become the new leader.

I realize once you think about it some time and understand the terms it's not so complicated anymore. I hope this also helps anyone in understanding these terms.

Wieche answered 11/8, 2017 at 9:1 Comment(1)

Zookeeper is a centralized open-source server for maintaining and managing configuration information, naming conventions and synchronization for distributed cluster environment. Zookeeper helps the distributed systems to reduce their management complexity by providing low latency and high availability. Zookeeper was initially a sub-project for Hadoop but now it's a top level independent project of Apache Software Foundation.

More Information

Hamachi answered 20/8, 2015 at 11:34 Comment(2)

What makes you say that zookeeper is centralized? Zookeeper can and should be run distributed. – Retrospection 19/2, 2017 at 19:31

I would suggest the following resources:

The paper: https://pdos.csail.mit.edu/6.824/papers/zookeeper.pdf
The lecture offered by MIT 6.824 from 36:00: https://youtu.be/pbmyrNjzdDk?t=2198

I would suggest watching the video, read the paper, and then watch the video again. It would be easier to understand if you know Raft beforehand.

Scruple answered 25/3, 2020 at 16:49 Comment(0)

My approach to understand zookeeper was, to play around with the CLI client. as described in Getting Started Guide and Command line interface

From this I learned that zookeeper's surface looks very similar to a filesystem and clients can create and delete objects and read or write data.

Example CLI commands

create /myfirstnode mydata
ls /
get /myfirstnode
delete /myfirstnode

Try yourself

How to spin up a zookeper environment within minutes on docker for windows, linux or mac:

One time set up:

docker network create dn

Run server in a terminal window:

docker run --network dn --name zook -d zookeeper
docker logs -f zookeeper

Run client in a second terminal window:

docker run -it --rm --network dn zookeeper zkCli.sh -server zook

Sunken answered 27/12, 2020 at 21:0 Comment(1)

Apache ZooKeeper is an open source technology for coordinating and managing configuration in distributed applications. It simplifies tasks like maintaining configuration details, enabling distributed synchronization, and managing naming registries.

It's aptly named - think about how a zookeeper goes around and takes care of all the animals, maintains their pens, feeds them, etc.

Apache ZooKeeper can be used with Apache projects like Apache Pinot or Apache Flink. Apache Kafka also uses ZooKeeper for managing brokers, topics, and partition info. Since Apache ZooKeeper open source, you can also pair it with any technology/project of your choosing, not just Apache Foundation projects.

Selfsufficient answered 15/10, 2021 at 17:7 Comment(1)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++