Consumer Stuck in Re-join

Asked 14/6, 2017 at 21:45 Answered 22/3, 2019 at 20:53

apache-kafka kafka-consumer-api message-hub

I've read other threads and I've gotten around the problem by using a new group ID, however I'd like to understand what could cause this.

I have a topic with 16 partitions, I've set session.timeout.ms=30000, and max.poll.interval.ms=30000000.

I run my program, and ctrl+c it, so it's not closing properly. After I guess, 16 times, I get stuck in this re-join issue. session.timeout.ms is the heartbeat timeout, so after 30 seconds it should kick my consumer right and my partitions should "free up" right? Or is it only listening to my max.poll.interval.ms?

EDIT: I still get this error intermittently, and when it happens i have to restart all my consumers. This happens even when my consumers were running fine and then they start all getting stuck at rejoining (no consumers were added/removed). Here's an error log from when I try to connect to it after with a new consumer when it's stuck in that state :

https://pastebin.com/AXJeSHkp

2017-06-29 17:28:16,215 DEBUG [AbstractCoordinator] - [scheduler-1] - Sending JoinGroup ((type: JoinGroupRequest, groupId=ingestion-matching-kafka-consumer-group-dev1, sessionTimeout=30000, rebalanceTimeout=43200000, memberId=, protocolType=consumer, groupProtocols=org.apache.kafka.common.requests.JoinGroupRequest$ProtocolMetadata@b45e5583)) to coordinator kafka04-prod01.messagehub.services.us-south.bluemix.net:9093 (id: 2147483644 rack: null)

2017-06-29 17:37:21,261 DEBUG [NetworkClient] - [scheduler-1] - Node 2147483644 disconnected.
2017-06-29 17:37:21,263 DEBUG [ConsumerNetworkClient] - [scheduler-1] - Cancelled JOIN_GROUP request {api_key=11,api_version=1,correlation_id=19,client_id=ingestion-matching-kafka-consumer-dev1} with correlation id 19 due to node 2147483644 being disconnected

Those are the first and last messages I think are relevant. Here are the relevant timeouts I've set:

session.timeout.ms=30000
max.poll.interval.ms=43200000    
request.timeout.ms=43205000 # the docs said to keep this higher than max.poll.interval.ms
enable.auto.commit=false

Should I set heartbeat.interval.ms too? Is this the interval that heartbeats are sent by the consumer to the broker automatically in some background thread (I have read the docs but for some reason I can't quite wrap my head around it)?

Lament answered 14/6, 2017 at 21:45 Comment(0)

I know it's a quite old question but I had similar issue and finally I understood the reason of this situation and want to share.

When rebalance starts Kafka waits all consumers in the group to poll() and send joinGroup request. Rebalance timeout is equal to max.poll.interval.ms. So Kafka waits until rebalance timeout or end of the process for each consumer.

In your case you set max.poll.interval.ms to 12 hours. Only sensible reason to that you must have a long process. So when rebalance starts Kafka will wait until your process is finished or 12 hours is passed. That's why your consumer seems stuck.

Hardtack answered 22/3, 2019 at 20:53 Comment(0)

If your client does not disconnect properly (crash or SIGINT), it will take session.timeout.ms (30 seconds in your case) for the server to kick it from the group. During this time, the server will still think the consumer is part of the group, so it will not do any reassignments. Once this delay is over, assigned partitions will be reassigned to other consumers (if any).

This of course does not happen if you use a new group ID. While it's tempting to use a new group everytime when developing (as you don't have to wait) you lose any committed offsets by the previous group and this might not represent the state your app will be in while running in production.

Regarding max.poll.interval.ms, it's the maximum delay allowed between 2 calls to poll() in your consumer logic. I don't think this setting is relevant to this question.

Ebullition answered 15/6, 2017 at 10:8 Comment(5)

Hm, thanks for the clarifications. If I'm waiting for much longer than 30 seconds (pretty much at least 10 min to rejoin) do you know what other timeout it could be? – Lament 16/6, 2017 at 14:5

In that case, I suggest you turn on debug or tracing in log4j and check what the client is doing – Ebullition 16/6, 2017 at 14:28

Hi, I've updated my original question with logs and more info, if you have any advice/insights that would be very appreciated – Lament 29/6, 2017 at 22:7

Just ran into the same issue. And it looks like if we do not kill the consumers gracefully, it gets stuck in re-joining forever. Looking at the trace logs, it looks like we are getting similar messages. – Ferrick 19/8, 2018 at 5:36

I experienced similar issue, it might have to do with large value for max.poll.interval.ms, similar as in your case, I also was using large value for max.poll.interval.ms. Did you get more insight into the problem ? I also learned that Kafka sets timeout for JoinGroup request based on max.poll.interval.ms kafka.apache.org/0101/documentation.html#upgrade_1010_notable – Blast 30/12, 2018 at 19:49

Recommended topics

Hot tags