Is Apache Kafka appropriate for use as an unordered task queue?
Asked Answered
L

5

78

Kafka splits incoming messages up into partitions, according to the partition assigned by the producer. Messages from partitions then get consumed by consumers in different consumer groups.

This architecture makes me wary of using Kafka as a work/task queue, because I have to specify the partition at time of production, which indirectly limits which consumers can work on it because a partition is sent to only one consumer in a consumer group. I would rather not specify the partition ahead of time, so that whichever consumer is available to take that task can do so. Is there a way to structure partitions/producers in a Kafka architecture where tasks can be pulled by the next available consumer, without having to split up work ahead of time by choosing a partition when the work is produced?

Using only one partition for this topic would put all the tasks in the same queue, but then the number of consumers is limited to 1 per consumer group, so each consumer would have to be in a different group. Then all of the task get distributed to each consumer group, though, which is not the kind of work queue I'm looking for.

Is Apache Kafka appropriate for use as a task queue?

Loginov answered 24/3, 2016 at 17:18 Comment(1)
On a side note: your problem can be solved using Apache Pulsar which has a shared topic-consumer subscription. See pulsar.apache.org/docs/latest/getting-started/…Mankind
C
65

Using Kafka for a task queue is a bad idea. Use RabbitMQ instead, it does it much better and more elegantly.

Although you can use Kafka for a task queue - you will get some issues: Kafka is not allowing to consume a single partition by many consumers (by design), so if for example a single partition gets filled with many tasks and the consumer who owns the partition is busy, the tasks in that partition will get "starvation". This also means that the order of consumption of tasks in the topic will not be identical to the order which the tasks were produced which might cause serious problems if the tasks needs to be consumed in a specific order (in Kafka to fully achieve that you must have only one consumer and one partition - which means serial consumption by just one node. If you have multiple consumers and multiple partitions the order of tasks consumption will not be guaranteed in the topic level).

In fact - Kafka topics are not queues in the computer science manner. Queue means First in First out - this is not what you get in Kafka in the topic level.

Another issue is that it is difficult to change the number of partitions dynamically. Adding or removing new workers should be dynamic. If you want to ensure that the new workers will get tasks in Kakfa you will have to set the partition number to the maximum possible workers. This is not elegant enough.

So the bottom line - use RabbitMQ or other queues instead.

Having said all of that - Samza (by linkedin) is using kafka as some sort of streaming based task queue: Samza

Edit: scale considerations: I forgot to mention that Kakfa is a big data/big scale tool. If your job rate is huge then Kafka might be good option for you despite the things I wrote earlier, since dealing with huge scale is very challenging and Kafka is very good in doing that. If we are talking about smaller scales (say, up to few dosens/hundreds of jobs per second) then again Kafka is a poor choice compared to RabbitMQ.

Collenecollet answered 31/3, 2016 at 6:54 Comment(10)
Might also be worth mentioning the fact that committing offsets quickly gets complex to handle failing tasks that needs retrying.Handout
"in Kafka to fully achieve that you must have only one consumer and one partition" is incorrect. Order is guaranteed for each partition in topic based on the partition key. So if order matters, you need to partition by the value on which order matters. This is actually stronger ordering guarantees than rabbitmq, which may only have one consumer to guarantee ordering.Negotiation
but then u can have just one consumer. which is not good enoughCollenecollet
One consumer per partition, not per topic. The issue is in rabbitmq as well. If you want messages to be processed in guaranteed order, then you can only have one consumer for that queue. You cannot process work in order with parallel consumers.Negotiation
in rabbitmq the consumption is in guaranteed order even with multiple consumers. The limitation in rabbit is not on consumption but on the work being done which u really can't guarantee to be in order. Kafka provides neither. Multiple consumers can't consume in order in Kafka. If one partition is full and the rest are empty you will get starvation in Kafka.Collenecollet
Kafka main advantage is in streaming of huge amount of data. If u r not streaming huge amount of data - Kafka is probably a bad choiceCollenecollet
"With release 2.7.0 and later it is still possible for individual consumers to observe messages out of order if the queue has multiple subscribers. This is due to the actions of other subscribers who may requeue messages. From the perspective of the queue the messages are always held in the publication order." rabbitmq.com/semantics.htmlNegotiation
individual consumer are irrelevant for work queues that should support multiple consumers...Collenecollet
Order is not guaranteed when you have multiple consumers in any meaningful way. What if one consumer fails and the task gets requeued? What if a consumer A finishes a task before consumer B, even though they received them in the opposite order? Kafka has iron clad ordering guarantees. The vast majority of message queues do not, including rabbit mq, unless you have a single producer and a single consumer.Hugh
For those working in AWS, SQS is the equivalent of RabbitMQ and would be a better choice for job execution than kafka or the AWS equivalent of kafka, which is Kinesis streams.Carnes
C
11

There is a lot of discussion in this topic revolving around order of execution of tasks in a work or task queue. I would put forth the notion that order of execution should not be a feature of a work queue.

A work queue is a means of controlling resource usage by applying a controllable number of worker threads towards completion of distinct tasks. Enforcing a processing order on tasks in a queue means you are also enforcing a completion order on tasks in the queue which effectively means that tasks in the queue would always be processed sequentially with the next task being processed only after the END of the preceding task. This effectively means you have a single threaded task queue.

If order of execution is important in some of your tasks, then those tasks should add the next task in the sequence to the work queue upon its completion. Either that or you support a Sequential Job type which when processed actually processes a list of jobs sequentially on one worker.

In no way should the work queue actually order any of its work - the next available processor should always take the next task with no regards to what has occurred prior to or after the task completes.

I was also looking at kafka as a basis for a work queue, but the more I research it, the less it looks like the desired platform.

I see it mainly being used as a means of synchronizing disparate resources and not so much as a means of executing disparate job requests.

Another area that I think is important in a work queue is the support of a prioritization of tasks. For example, if I have 20 tasks in the queue, and a new task arrives with a higher priority, I want that task to jump to the start of the line to be picked up by the next available worker. Kafka would not allow this.

Carnes answered 15/5, 2018 at 21:47 Comment(1)
You could do task prioritization with multiple topics, one for each priority level. Each topic is partitioned identically, and you have a coordinator process for each partition. The coordinator processes each have a consumer for each topic and maintain a local priority queue based on the priorities of the topics. The coordinator can then fan back out from there, farming out tasks to whatever pool of resources it controls, whether local or remote.Valenta
O
9

I would say that this depends on the scale. How many tasks do you anticipate in a unit of time?

What you describe as your end goal is basically how Kafka works by default. When you produce messages, default (most widely used) option is to use random partitioner, which chooses partitions in the round robin fashion, keeping partitions evenly used (so it's possible to avoid specifying a partition).
The main purpose of partitions is to parallelize processing of messages, so you should use it in such a manner.
Other commonly used "thing" that partitions are used for is assuring that certain messages get consumed in the same order as they are produced (then you specify partitioning key in such a way that all such messages end up in the same partition. E.g. using userId as key would assure all users are processed in such a way).

Octal answered 24/3, 2016 at 20:7 Comment(3)
Thanks for your answer Marko, maybe we can get to the bottom of this with an example. So say we have 20 partitions and 2 workers, and 100 new jobs come in. With round robin, the job messages get distributed 5 to each partition, and then each consumer gets 10 partitions, which is 50 jobs. Say that one consumer's 50 jobs takes 100 milliseconds (for all of them combined), but the other consumer's 50 jobs takes 2 minutes. Will the consumer that finished early be able to to help out the overloaded consumer? Does Kafka make some kind of assumption about equal job difficulties?Loginov
Hey Marko, I think my last question in that comment got to the heart of the issue here, if you can just add some more detail for that, then I'll definitely accept your answer!Loginov
Any of those 100 messages would go to a random partition and would get picked up by one of those two (i.e. random) Consumers, then the second message, then the third, ... so it's not like each Consumer will get a bulk of 50 messages, i.e. they "help each other out". But why would you limit yourself to only 2 Consumer threads? Also, you would commit the offset only after each message is processed, to make sure you don't lose any messages if processing is unsuccessful.Octal
N
6

There are two main obstacles in trying to use Kafka as a message queue:

  1. as described in Ofer's answer, you can only consume a single partition from a single consumer, and order of processing is guaranteed only within a partition. So if you can't distribute the tasks fairly across partitions, this might be a problem

  2. by default, you can only acknowledge processing of all messages up to a given point (offset). Unlike in traditional message queues, you can't do selective acknowledgment and in case of failure, selective retries. This can be address by using kmq, which adds individual acks capability with the help of an additional topic (disclaimer: I'm the author of kmq).

RabbitMQ is an alternative of course, but it also gives different (lower) performance and replication guarantees. In short, RabbitMQ docs state that the broker is not partition tolerant. See also our comparison of message queues with data replication, mqperf.

Nikolaos answered 27/6, 2017 at 13:54 Comment(0)
A
0

I am developing a library that implement a job queue on top of kafka, https://github.com/JingIsCoding/kafka-job-queue I am using multiple queues to maintain tasks that are ready to be processed, future tasks and dead tasks, contribution is welcomed

Alida answered 19/6, 2021 at 21:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.