Why is AWS MSK Kafka broker constantly disconnecting and reconnecting the consumer group

About

Asked 12/10, 2021 at 9:12 Answered 3/11, 2021 at 13:29

amazon-web-services apache-kafka aws-msk

I have AWS MSK Kafka cluster with 2 brokers. From the logs I can see (on each broker) that they are constantly rebalancing. Every minute I can see in logs:

Preparing to rebalance group amazon.msk.canary.group.broker-1 in state PreparingRebalance with old generation 350887 (__consumer_offsets-21) (reason: Adding new member consumer-amazon.msk.canary.group.broker-1-27058-8aad596f-b00d-428a-abaa-f3a28d714f89 with group instance id None) (kafka.coordinator.group.GroupCoordinator)

And 25 seconds later:

Preparing to rebalance group amazon.msk.canary.group.broker-1 in state PreparingRebalance with old generation 350888 (__consumer_offsets-21) (reason: removing member consumer-amazon.msk.canary.group.broker-1-27058-8aad596f-b00d-428a-abaa-f3a28d714f89 on LeaveGroup) (kafka.coordinator.group.GroupCoordinator)

Why this happens? What is causing it? And what is amazon.msk.canary.group.broker-1 consumer group?

Mani answered 12/10, 2021 at 9:12 Comment(4)

Sorry I can't help you with why it is rebalancing and dropping, but the canary consumer group is what AWS uses to monitor health and metrics on the Kafka Cluster: docs.aws.amazon.com/msk/latest/developerguide/… Are there no other server logs? Kafka has 2 logs by default, the message/topic logs, and the internal application logs. I believe the default internal application logs are located at: $kafka_Home/kafka/logs/ Might have to do some digging around to find the right logs. – Crenulation 12/10, 2021 at 9:18

I'm experiencing the same behaviour on my MSK clusters. Did you end up finding the cause? – Lepanto 15/3, 2022 at 15:18

Did you find any explanation for this? We experience the same symptom in a 3 m5.large broker cluster running Kafka 2.8.1. The cluster will run fine for a few days, then occasionally the amazon.msk.canary.group.broker-N groups for all 3 brokers go into a tight rebalance loop. The only solution is to do a rolling restart of all brokers in the cluster. – Candancecandela 13/7, 2022 at 16:54

We have the same issue and the response from AWS was "This from internal consumer groups managed by MSK. Amazon MSK creates and uses the following internal topics: __amazon_msk_canary and __amazon_msk_canary_state for cluster health and diagnostic metrics. Consumer groups (amazon.msk.canary*) shown in the logs are MSK's internal, therefore you do not need to be worried about them." – Valuate 31/8, 2022 at 1:26

May it be something with the configuration of Java’s garbage collection on the brokers? I remember reading that a misconfiguration of the garbage collectors can cause the broker to pause for a few seconds and lose connectivity to the Zookeeper, hence the flipping behavior. Could you check whether you are applying any custom configuration for garbage collection? (i.e. via KAFKA_JVM_PERFORMANCE_OPTS environmental variable)

Moderato answered 3/11, 2021 at 13:29 Comment(1)

We have no control of such variables at all. All we can do is set some kafka configuration for MSK or increase number of brokers. – Mani 4/11, 2021 at 14:55

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags