Environment
- 3-node Kafka Cluster
- Amazon MSK
- v2.3
- 1 topic
- 6 partitions
- 1 consumer group with 2 consumers
- Running in Kubernetes
- Confluent .NET SDK 1.2.2
- Except for
bootstrap.servers
andgroup.id
, all of the default settings.
Problem
First, one of my consumers encounters the following exception.
Confluent.Kafka.KafkaException: Broker: Specified group generation id is not valid
at Confluent.Kafka.Impl.SafeKafkaHandle.Commit(IEnumerable`1 offsets)
at Confluent.Kafka.Consumer`2.Commit(IEnumerable`1 offsets)
The exception is trapped and the consumer is supposed to retry, but instead the app sits idle. The container is still up and running, but not consuming any more messages.
What's weirder is that the broker never reassigns that consumer's partitions so the consumer lag on those partitions begins to grow. It seems like the consumer is both alive (since the broker is not reassigning its partitions) and dead (since it cannot commit its offset or consume more messages). If we intervene and manually restart the consumers then the partitions are reassigned and the situation goes back to normal.
I'm not entirely sure what to make of the exception above. Google doesn't offer much. The most relevant lead I have is this issue in GitHub, which involves a broker restarting. To my knowledge, that is not happening in my situation. Any assistance would be greatly appreciated.