Kafka Connect gets into a re balance loop

Asked 11/8, 2017 at 12:54 Answered 29/5, 2023 at 22:7

apache-kafka kafka-consumer-api apache-kafka-connect

I've just deployed my Kafka Connect (I only use a connect source to MQTT) application on a cluster of two instances (2 containers on 2 machines) and now it seems to get into a sort of rebalancing loop,I've got a little bit of data at the beginning,but no new data appears.This is what I get in my log.

[2017-08-11 07:27:35,810] INFO Joined group and got assignment: Assignment{error=0, leader='connect-1-592bcc91-9d99-4c54-b707-3f52d0f8af50', leaderUrl='http:// 10.120.233.78:9040/', offset=2, connectorIds=[SourceConnector1], taskIds=[]} (org.apache.kafka.connect.runtime.distributed.DistributedHerder:1009)
[2017-08-11 07:27:35,810] WARN Catching up to assignment's config offset. (org.apache.kafka.connect.runtime.distributed.DistributedHerder:679)
[2017-08-11 07:27:35,810] INFO Current config state offset 1 is behind group assignment 2, reading to end of config log (org.apache.kafka.connect.runtime.distributed.DistributedHerder:723)
[2017-08-11 07:27:36,310] INFO Finished reading to end of log and updated config snapshot, new config log offset: 1 (org.apache.kafka.connect.runtime.distributed.DistributedHerder:727)
[2017-08-11 07:27:36,310] INFO Current config state offset 1 does not match group assignment 2. Forcing rebalance. (org.apache.kafka.connect.runtime.distributed.DistributedHerder:703)
[2017-08-11 07:27:36,311] INFO Rebalance started (org.apache.kafka.connect.runtime.distributed.DistributedHerder:1030)
[2017-08-11 07:27:36,311] INFO Wasn't unable to resume work after last rebalance, can skip stopping connectors and tasks (org.apache.kafka.connect.runtime.distributed.DistributedHerder:1056)
[2017-08-11 07:27:36,311] INFO (Re-)joining group source-connector11234 (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:381)
[2017-08-11 07:27:36,315] INFO Successfully joined group source-connector11234 with generation 28 (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:349)
[2017-08-11 07:27:36,317] INFO Joined group and got assignment: Assignment{error=0, leader='connect-1-592bcc91-9d99-4c54-b707-3f52d0f8af50', leaderUrl='http:// 10.120.233.78:9040/', offset=2, connectorIds=[SourceConnector1], taskIds=[]} (org.apache.kafka.connect.runtime.distributed.DistributedHerder:1009)
[2017-08-11 07:27:36,317] WARN Catching up to assignment's config offset. (org.apache.kafka.connect.runtime.distributed.DistributedHerder:679)
[2017-08-11 07:27:36,317] INFO Current config state offset 1 is behind group assignment 2, reading to end of config log (org.apache.kafka.connect.runtime.distributed.DistributedHerder:723

Hurter answered 11/8, 2017 at 12:54 Comment(1)

Make sure you set rest.advertised.host.name and the two connect servers can resolve one another via that name – Lookthrough 11/12, 2018 at 3:58

I too faced a similar problem, running two separate containers on a mesos cluster - the eventual solution was an annoying one that is not documented anywhere:

Use an odd number of containers!

Some distributed systems rely on their workers to elect a leader. If there are two, they each vote for the other and get stuck in a loop. This appears to be what's happening here as well.

Purpurin answered 16/7, 2019 at 9:45 Comment(0)

I've encountered a similar problem. Creating a second (or three) nodes of kafka-connect on the cluster started producing an eternal rebalance. On my case the topic connect-offsets was created with a number of partitions of 5 (my kafka default) instead of the automatic 25 partitions. (ref https://docs.confluent.io/platform/current/connect/references/allconfigs.html on connect-offsets). To analyze if this is your case run, and check the output should be something like:

$ bin/kafka-topics.sh --topic connect-configs --bootstrap-server the-url-of-your-kafka:9092 --describe
Topic: connect-offsets  PartitionCount: 25      ReplicationFactor: 3    Configs: cleanup.policy=compact,message.format.version=2.7-IV2
Topic: connect-offsets  Partition: 0    Leader: 1       Replicas: 1,0,2 Isr: 0,2,1
... repeated that line 25 times...

and pay special attention to PartitionCount param.

Also you might take a peek on the topic connect-configs topic configuration with a similar command.

Krupp answered 29/5, 2023 at 22:7 Comment(0)

Recommended topics

Hot tags