Why is Kafka pull-based instead of push-based?
Asked Answered
K

4

87

Why is Kafka pull-based instead of push-based? I agree Kafka gives high throughput as I had experienced it, but I don't see how Kafka throughput would go down if it were to pushed based. Any ideas on how push-based can degrade performance?

Kenley answered 20/9, 2016 at 5:53 Comment(0)
A
86

Scalability was the major driving factor when we design such systems (pull vs push). Kafka is very scalable. One of the key benefits of Kafka is that it is very easy to add large number of consumers without affecting performance and without down time.

Kafka can handle events at 100k+ per second rate coming from producers. Because Kafka consumers pull data from the topic, different consumers can consume the messages at different pace. Kafka also supports different consumption models. You can have one consumer processing the messages at real-time and another consumer processing the messages in batch mode.

The other reason could be that Kafka was designed not only for single consumers like Hadoop. Different consumers can have diverse needs and capabilities.

Pull-based systems have some deficiencies like resources wasting due to polling regularly. Kafka supports a 'long polling' waiting mode until real data comes through to alleviate this drawback.

Auxiliary answered 20/9, 2016 at 6:12 Comment(3)
It'd be interesting to know about the advantages of push-based though.Gustative
Kafka also supports pub-sub model which in my opinion not pull-based.Cons
@YugSingh The OP is referring specifically to the fact that topic consumers are responsible for fetching (pulling) messages; instead of the broker being responsible for pushing them to connected consumers.Levitus
K
48

Refer to the Kafka documentation which details the particular design decision: Push vs pull

Major points that were in favor of pull are:

  1. Pull is better in dealing with diversified consumers (without a broker determining the data transfer rate for all);
  2. Consumers can more effectively control the rate of their individual consumption;
  3. Easier and more optimal batch processing implementation.

The drawback of a pull-based systems (consumers polling for data while there's no data available for them) is alleviated somewhat by a 'long poll' waiting mode until data arrives.

Kenyettakenyon answered 20/9, 2016 at 6:1 Comment(5)
You have it exactly backwards. From the docs: "However, a push-based system has difficulty dealing with diverse consumers..."Starwort
Oops that was a typo, I should have been more careful,. Corrected it, Thanks a lot for notifying.Kenyettakenyon
Another drawback to pull-based systems is that latency will be greater, due to the pause between polling requests where data shows up and is waiting for a pull.Secretory
Just curious. Isnt point 1 and 2 same? that is the advantage of pull is that different consumers have different rates at which they might want to consume? Not a native speaker, so I am just confused between points 1 and 2 and why they are separate instead of being in one point/sentence?Bombay
Upvote for including the official linkSuzerainty
S
45

Others have provided answers based on Kafka's documentation but sometimes product documentation should be taken with a grain of salt as an absolute technical reference. For example:

  • Numerous push-based messaging systems support consumption at different rates, usually through their session management primitives. You establish/resume an active application layer session when you want to consume and suspend the session (e.g. by simply not responding for less than the keepalive window and greater than the in-flight windows...or with an explicit message) when you want to stop/pause. MQTT and AMQP, for example both provide this capability (in MQTT's case, since the late 90's). Given that no actions are required to pause consumption (by definition), and less traffic is required under steady stable state (no request), it is difficult to see how Kafka's pull-based model is more efficient.
  • One critical advantage push messaging has vs. pull messaging is that there is no request traffic to scale as the number of potentially active topics increases. If you have a million potentially active topics, you have to issue queries for all those topics. This concern becomes especially relevant at scale.
  • The critical advantage pull messaging has vs push messaging is replayability. This factors a great deal into whether downstream systems can offer guarantees around processing (e.g. they might fail before doing so and have to restart or e.g. fail to write messages recoverably).
  • Another critical advantage for pull messaging vs push messaging is buffer allocation. A consuming process can explicitly request as much data as they can accommodate in a pre-allocated buffer, rather than having to allocate buffers over and over again. This gains back some of the goodput losses vs push messaging from query scaling (but not much). The impact here is measurable, however, if your message sizes vary wildly (e.g. a few KB->a few hundred MB).
  • It is a fallacy to suggest that pull messaging has structural scalability advantages over push messaging. Partitioning is what is usually used to provide scale in messaging applications, regardless of the consumption model. There are push messaging systems operating well in excess of 300M msgs/sec on hard wired local clusters...125K msgs/sec doesn't even buy admission to the show. In fact, pull messaging has inferior goodput by definition and systems like Kafka usually end up with more hardware to reach the same performance level. The benefits noted above may often make it worth the cost. I am unaware of anyone using Kafka for messaging in high frequency trading, for example, where microseconds matter.

It may be interesting to note that various push-pull messaging systems were developed in the late 1990s as a way to optimize the goodput. The results were never staggering and the system complexity and other factors often outweigh this kind of optimization. I believe this is Jay's point overall about practical performance over real data center networks, not to mention things like the open Internet.

Sixty answered 4/3, 2019 at 22:39 Comment(16)
"One critical advantage push messaging has vs. pull messaging is that there is no request traffic to scale as the number of potentially active topics increases. If you have a million potentially active topics, you have to issue queries for all those topics. This concern becomes especially relevant at scale." -- If you have a large number of the consumer the Push model has to keep track and manage all the consumer info. Also what if consumers cannot consume at the rate broker is pushing? So both of these problems exist at scale in Push model as well.Kenley
@Kenley No, that's not exactly true. What I believe you are thinking of as "routing table metadata" can be embedded directly into a dissemination hierarchy. See for example: semanticscholar.org/paper/… You cannot compare routing table maintenance to request overhead. The only shortcut request overhead has is batch requests. Routing has numerous shortcuts (for example: making client session establishment atomic , paying the cost up front for the routing table changes).Sixty
@Kenley You will additionally note that I addressed the issue of variable consumption rates quite clearly in the original post :). It is a non-issue in both models.Sixty
"routing table metadata" can be embedded directly into a dissemination hierarchy". so are you talking about something like Kademlia where a broker multicasts to a group of consumers instead of all and then each member in the group doing the same? Then don't you need to put all this logic into consumer?Kenley
"There are push messaging systems operating well in excess of 300M msgs/sec on hard wired local clusters" can you name some?Kenley
@Kenley (assuming some distributed setting with a cluster of brokers functioning as a single logical system...). If I have multiple subscribers to a topic and the publisher resides on a different broker node, that broker node has to know where to forward the message to in a push model. That is the "routing table metadata" I am speaking of...Brokers have to know what topics have subscribers on other brokers if they have a publisher for those topics.Sixty
yeah so imagine there are 1M subscribers on a topic. Now, you want this broker to push messages to all 1M messages and are you claiming that is scalable as well?Kenley
@Kenley "Can you name some" Sure. IBM MessageSight appliances did 13M messages/sec with microsecond latency. Each. In 2013. ibm.com/developerworks/community/blogs/mobileblog/entry/…Sixty
@Kenley Far more scalable than pull requests, yep. Dissemination hierarchies are usually trees, hence you control the fan out which trades off outright latency for throughput. It is vastly more scalable. The best you can hope for with something like Kafka is to spread the reads evenly across the replicas.Sixty
Dissemination hierarchies among subscribers? If so, then I interepreted correctly as before therefore I have the same question as before. "routing table metadata" can be embedded directly into a dissemination hierarchy". so are you talking about something like Kademlia where a broker multicasts to a group of consumers instead of all and then each consumer in the group doing the same? Then don't you need to put all this logic into consumer?Kenley
@Kenley I think the confusion here is caused by the use of "consumer". Say I have 1000 brokers in a cluster. I have a publisher on broker 0. I have a million subscribers attached randomly to brokers. The naive way would be for broker 0 to fan out directly to all 999 other brokers. To get higher throughput, I could forward these along Kad's finger table (to use your Kad example, but Brisa is designed for this). Say I use a tree with a fanout of 8, rooted at the publisher's broker. The root broker sends 8 msgs, and the child brokers propagate the message to their children and subscribers.Sixty
consumer = subscriber. "consumer is just a more common term in the Kafka world".Kenley
What is Brisa? A quick google search didn't show me anything relevantKenley
@Kenley Alternately, look at it like this. With pull, all the "routing metadata" appears to be in the request, but this is not true unless the consumer happens to hit a replica. If they hit a non-replica and the request is routed internally, it's simply the same replica:n fan out (in reverse). A dissemination hierarchy lets you control this. It's explicitly the trade off of latency for throughput. There is no free lunch in either solution. Maintaining a dissemination hierarchy under churn is quite complicated.Sixty
@Kenley I linked to the Brisa paper in my comments above..again: semanticscholar.org/paper/… You can also check out something similar which uses random gossip for dissemination: csl.mtu.edu/cs6461/www/Reading/Birman99.pdf Or newscasting: cs.unibo.it/bison/publications/ap2pc03.pdfSixty
as a Kafka professional for now 7 years, I can tell this is the best answer hereIllogicality
C
5

Kafka uses a pull-based system that allows users to request messages. Pushing is just extra work for the broker. With Kafka, the responsibility of fetching messages is on consumers. Consumers can decide at what rate they want to process the messages.

If a broker is pushing messages and if some of the consumers are down, the broker will retry certain times to push the messages till they decide not to push anymore. This decreases performance. Imagine the workload of pushing messages to multiple consumers. push-based approach is suited for low latency messaging.

Crayon answered 5/10, 2022 at 0:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.