Separate zookeeper install or not using kafka 10.2?
Asked Answered
W

3

8

I would like to use the embedded Zookeeper 3.4.9 that come with Kafka 10.2, and not install Zookeeper separately. Each Kafka broker will always have a 1:1 Zookeeper on localhost.

So if I have 5 brokers on hosts A, B, C, D and E, each with a single Kafka and Zookeeper instance running on them, is it sufficient to just run the Zookeeper provided with Kafka?

What downsides or configuration limitations, if any, does the embedded 3.4.9 Zookeeper have compared to the standalone version?

Welby answered 13/7, 2017 at 20:53 Comment(1)
Hello @redgiant, i intend to run kafka and zookeeper on the same box and i would like to know if you face any issues? Also, do you use a supervisory process to manage your zookeeper? Thanks.Idolatrize
S
11

These are a few reason not to run zookeeper on the same box as Kafka brokers.

  1. They scale differently

    5 zk and 5 Kafka works but 6:6 or 11:11 do not. You don't need more than 5 zookeeper nodes even for a quite large Kafka cluster. Unlike Kafka, Zookeeper replicates data to all nodes so it gets slower as you add more nodes.

  2. They compete for disk I/O

    Zookeeper is very disk I/O latency sensitive. You need to have it on a separate physical disk from the Kafka commit log or you run the risk that a lot of publishing to Kafka will slow zookeeper down and cause it to drop out of the ensemble causing potential problems.

  3. They compete for page cache memory

    Kafka uses Linux OS page cache to reduce disk I/O. When other apps run on the same box as Kafka you reduce or "pollute" the page cache with other data that takes away from cache for Kafka.

  4. Server failures take down more infrastructure

If the box reboots you lose both a zookeeper and a broker at the same time.

Simonetta answered 14/7, 2017 at 21:36 Comment(1)
Thanks, I agree that the 1:1 curve diverges with more than 5 nodes as not useful. I do have 2 separate SSDs where I configure the kafka vs zookeeper data/log/parition onto. The VMs are beefy with 32G ram and so on, so I should be good with our 300m/day throughput. I will continue with my 5/5 embedded use for now, and also continue prototyping a 5/3 separate setup.Welby
K
5

Even though ZooKeeper comes with each Kafka release it does not mean they should run on the same server. Actually, it is advised that in a production environment they run on separate servers.

In the Kafka broker configuration you can specify the ZooKeeper address, and it can be local or remote. This is from broker config (config/server.properties):

# Zookeeper connection string (see zookeeper docs for details).
# This is a comma separated host:port pairs, each corresponding to a zk
# server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002".
# You can also append an optional chroot string to the urls to specify the
# root directory for all kafka znodes.
zookeeper.connect=localhost:2181

You can replace localhost with any other accessible server name or IP address.

Kermanshah answered 13/7, 2017 at 21:5 Comment(3)
Right, I know how to configure the scenarios that are possible. My question is specifically about why you would not co-locate them, the only things I could think of: shared disks, shared OS file handles limits.Welby
This post gives a good summary: grokbase.com/t/kafka/users/144rzmzp0w/…. Plus, 1) if there are different applications (other than Kafka) that rely on ZooKeeper, it seems reasonable to not have a Kafka Broker on the ZooKeeper nodes. 2) a server failure would affect one of the two, not both.Kermanshah
Thanks, I will stay with my 5/5 embedded zookeeper setup for now, and continue looking into a 5/3 split setup for the future.Welby
L
2

We've been running a setup as you described, with 3 to 5 nodes, each running a kafka broker and the zookeeper that comes with kafka distribution on the same nodes. No issues with that setup so far, but our data throughput isn't high.

If we were to scale above 5 nodes we'd separate them, so that we only scale kafka brokers but keep the zookeeper ensemble small. If zookeeper and kafka start competing for I/O too much, then we'd move their data directories to separate drives. If they start competing for CPU, then we'd move them to separate boxes.

All in all, it depends on your expected throughput and how easily you can upgrade your setup if it starts causing contention. You can start small and easy, with kafka and zookeeper co-located as long as you have the flexibility to upgrade your setup with more nodes and introduce separation later on. If you think this will be hard to add later, better start running them separate from the start. We've been running them co-located for 18+ months and haven't encountered resource contention so far.

Lenalenard answered 16/7, 2017 at 11:32 Comment(1)
Yes, that is what I am doing. I use Ansible to install and configure everything from the base OS VM up, so I could change from a 5/5 setup to a 5/3 split setup easily later if desired.Welby

© 2022 - 2024 — McMap. All rights reserved.