Why is AWS MSK Kafka broker constantly disconnecting and reconnecting the consumer group
Asked Answered
M

1

7

I have AWS MSK Kafka cluster with 2 brokers. From the logs I can see (on each broker) that they are constantly rebalancing. Every minute I can see in logs:

Preparing to rebalance group amazon.msk.canary.group.broker-1 in state PreparingRebalance with old generation 350887 (__consumer_offsets-21) (reason: Adding new member consumer-amazon.msk.canary.group.broker-1-27058-8aad596f-b00d-428a-abaa-f3a28d714f89 with group instance id None) (kafka.coordinator.group.GroupCoordinator)

And 25 seconds later:

Preparing to rebalance group amazon.msk.canary.group.broker-1 in state PreparingRebalance with old generation 350888 (__consumer_offsets-21) (reason: removing member consumer-amazon.msk.canary.group.broker-1-27058-8aad596f-b00d-428a-abaa-f3a28d714f89 on LeaveGroup) (kafka.coordinator.group.GroupCoordinator)

Why this happens? What is causing it? And what is amazon.msk.canary.group.broker-1 consumer group?

Mani answered 12/10, 2021 at 9:12 Comment(4)
Sorry I can't help you with why it is rebalancing and dropping, but the canary consumer group is what AWS uses to monitor health and metrics on the Kafka Cluster: docs.aws.amazon.com/msk/latest/developerguide/… Are there no other server logs? Kafka has 2 logs by default, the message/topic logs, and the internal application logs. I believe the default internal application logs are located at: $kafka_Home/kafka/logs/ Might have to do some digging around to find the right logs.Crenulation
I'm experiencing the same behaviour on my MSK clusters. Did you end up finding the cause?Lepanto
Did you find any explanation for this? We experience the same symptom in a 3 m5.large broker cluster running Kafka 2.8.1. The cluster will run fine for a few days, then occasionally the amazon.msk.canary.group.broker-N groups for all 3 brokers go into a tight rebalance loop. The only solution is to do a rolling restart of all brokers in the cluster.Candancecandela
We have the same issue and the response from AWS was "This from internal consumer groups managed by MSK. Amazon MSK creates and uses the following internal topics: __amazon_msk_canary and __amazon_msk_canary_state for cluster health and diagnostic metrics. Consumer groups (amazon.msk.canary*) shown in the logs are MSK's internal, therefore you do not need to be worried about them."Valuate
M
0

May it be something with the configuration of Java’s garbage collection on the brokers? I remember reading that a misconfiguration of the garbage collectors can cause the broker to pause for a few seconds and lose connectivity to the Zookeeper, hence the flipping behavior. Could you check whether you are applying any custom configuration for garbage collection? (i.e. via KAFKA_JVM_PERFORMANCE_OPTS environmental variable)

Moderato answered 3/11, 2021 at 13:29 Comment(1)
We have no control of such variables at all. All we can do is set some kafka configuration for MSK or increase number of brokers.Mani

© 2022 - 2024 — McMap. All rights reserved.