During rolling upgrade/restart, how to detect when a kafka broker is "done"?
Asked Answered
S

4

7

I need to automate a rolling restart of a kafka cluster (3 kafka brokers). I can easily do it manually - restart one after the other, while checking the log to see when it's fine (e.g., when the new process has joined the cluster).

What is a good way to automate this check? How can I ask the broker whether it's up and running, connected to its peers, all topics up-to-date and such? In my restart script, I have access to the metrics, but to be frank, I did not really see one there which gives me a clear picture.

Another way would be to ask what a good "readyness" probe would be that does not simply check some TCP/IP port, but looks at the actual server...

Snook answered 18/1, 2019 at 8:6 Comment(0)
B
5

I would suggest exposing JMX metrics and tracking the following for cluster health

  • the controller count (must be 1 over the whole cluster)
  • under replicated partitions (should be zero for healthy cluster)
  • unclean leader elections (if you don't disable this in server.properties make sure there are none in the metric counts)
  • ISR shrinks within a reasonable time period, like 10 minute window (should be none)

Also, Yelp has tooling for rolling restarts implemented in Python, which requires Jolokia JMX Agents installed on the brokers, and it polls the metrics to make sure some of the above conditions are true

Betti answered 18/1, 2019 at 20:26 Comment(2)
Think that Yelps solution is only when your cluster is in non-security mode. Please correct me if I'm wrongTripitaka
I don't think the Kafka security protocol matters. The script seems to use SSH and HTTP .Betti
X
2

Assuming your cluster was healthy at the beginning of the restart operation, at a minimum, after each broker restart, you should ensure that the under-replicated partition count returns to zero before restarting the next broker.

As the previous responders mentioned, there is existing code out there to automate this. I don’t use Jolikia, myself, but my solution (which I’m working on now) also uses JMX metrics.

Xenon answered 15/3, 2019 at 0:45 Comment(0)
P
1

Kakfa Utils by Yelp is one of the best tools that can be used to detect when a kafka broker is "done". Specifically, kafka_rolling_restart is the tool which gets broker details from zookeeper and URP (Under Replicated Partitions) metrics from each broker. When a broker is restarted, total URPs across Kafka cluster is periodically collected and when it goes to zero, it restarts another broker. The controller broker is restarted at the last.

Pre answered 5/7, 2019 at 9:41 Comment(0)
L
0

As other people said URP (under-replicated-partition) should return 0 before you restart the next broker.

If you are monitoring kafka, you need to make sure:

sum(kafka_topic_partition_under_replicated_partition) 

is equal to 0.

Longshore answered 27/9 at 10:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.