Kafka get to know when related messages are consumed

Asked 11/9, 2020 at 17:52 Answered 14/9, 2020 at 9:9

apache-kafka apache-kafka-streams batching

is there any way, in Kafka, to produce a message once several related messages have been consumed ? (without having to manually control it at the application code...)

The use case would be to pick a huge file, split it into several chunks, publish a message for each of these chunks in a topic, and once all these messages are consumed produce another message notifying the result on another topic.

We can do it with a database, or REDIS, to control the state but I wonder if there's any higher level approach leveraging only Kafka ecosystem.

Mishap answered 11/9, 2020 at 17:52 Comment(2)

How does your consumer in "and once all these messages are consumed" look like? Is it also a Kafka Streams application or something else? – Straggle 11/9, 2020 at 19:45

Originally it would be a Spring-Boot Kotlin application consumer, but we would be open to options... – Mishap 11/9, 2020 at 20:14

You can use ConsumerGroupCommand to check if certain consumer group has finished processing all messages in a particular topic:

$ kafka-consumer-groups --bootstrap-server broker_host:port --describe --group chunk_consumer

$ kafka-run-class kafka.admin.ConsumerGroupCommand ...

Zero lag for every partition will indicate that the messages have been consumed successfully, and offsets committed by the consumer.

Alternatively, you can choose to subscribe to the __consumer_offsets topic and process messages from it yourself, but using ConsumerGroupCommand seems like a more straightforward solution.

Negrete answered 11/9, 2020 at 19:17 Comment(2)

As far as I understaand consumergroups would be tied to a specific application and not rather created dinamically for each of the files. I suspect the other answer, using kafka streams make more sense for the particular use case. – Mishap 15/9, 2020 at 11:51

Not sure I follow - a commit of an offset by consumer IS a confirmation that message was consumed successfully. So if, from producer side, you monitor offsets and ensure that all offsets are committed, you know that all your "chunks" are consumed. Once that happens, you can then publish a confirmation or do whatever else you need to do. – Negrete 15/9, 2020 at 12:40

Approach can be as follow:

After consuming each chunk application should produce message with status (Consumed, and chunk number)
Second application (Kafka Streams once) should aggregate result and, when process messages with all chunks produce final message, that file is processed.

Paleozoology answered 14/9, 2020 at 9:9 Comment(2)

it does make sense to me and sounds promising, but how would we, on the kafka streams once, know that all chunks were processed (excuse my ignorance, never really used streams). Do you have any documentation or snippet pointing to that ? – Mishap 15/9, 2020 at 11:53

for instance: message with chunk status can be as follow:` (key: fileUniqueName, value: chunkNumber, numberOfChunks)`. At Kafka streams application you can use ProcessorApi (kafka.apache.org/10/documentation/streams/developer-guide/…) and aggregate it in custom way - using state store you can keep status regarding number of processed chunks – Worldling 15/9, 2020 at 11:58

Recommended topics

Hot tags