I am building an Alert Monitoring tool for Kafka.
I do understand that there can be metrics for which the thresholds depends on application data. But I am only interested in knowing those metrics and threshold values which will help me in knowing the lag and help in determining if any scaling is required.
As of now I can do following :
- Enable JMX on Kafka Broker
- Fecth JMX metrics using JMX Java client or jCOnsole.
Next I researched and found so many metrics but none had comnplete thresholds (eg some value or pattern like increasing or decreasing or may be some maths ) over which I should write my logic for metrics .
Few Example are following :
UnderReplicatedPartitions - Alert if value is greater than 0.
records-lag-max - alert if value increases with time .
OfflinePartitionsCount - alert if value is greater then zero
ActiveControllerCount - alert if value other than 1 .