Understanding Kafka Topics and Partitions
Asked Answered
M

5

381

I am starting to learn Kafka. During my readings, some questions came to my mind:

  1. When a producer is producing a message, it will specify the topic it wants to send the message to. Is that right? Does it care about partitions?

  2. When a subscriber is running, does it specify its group id so that it can be part of a cluster of consumers of the same topic or several topics that this group of consumers is interested in?

  3. Does each consumer group have a corresponding partition on the broker or does each consumer have one?

  4. Are the partitions created by the broker, and therefore not a concern for the consumers?

  5. Since this is a queue with an offset for each partition, is it the responsibility of the consumer to specify which messages it wants to read? Does it need to save its state?

  6. What happens when a message is deleted from the queue? - For example, the retention was for 3 hours, then the time passes, how is the offset being handled on both sides?

Mickel answered 25/6, 2016 at 2:58 Comment(0)
D
394

This post already has answers, but I am adding my view with a few pictures from Kafka Definitive Guide

Before answering the questions, let's look at an overview of producer components:

Overview of producer components


  1. When a producer is producing a message, it will specify the topic it wants to send the message to. Is that right? Does it care about partitions?

The producer will decide target partition to place any message, depending on:

  • Partition id, if it's specified within the message
  • key % num partitions, if no partition id is mentioned
  • Round robin if neither partition id nor message key is available in the message means only the value is available

  1. When a subscriber is running - Does it specify its group id so that it can be part of a cluster of consumers of the same topic or several topics that this group of consumers is interested in?

You should always configure group.id unless you are using the simple assignment API and you don’t need to store offsets in Kafka. It will not be a part of any group. Source.


  1. Does each consumer group have a corresponding partition on the broker or does each consumer have one?

In one consumer group, each partition will be processed by one consumer only. These are the possible scenarios

  • If the number of consumers is less than the number of topic partitions, then multiple partitions can be assigned to one of the consumers in the group

    Number of consumers less than topic partitions

  • If the number of consumers is the same as the number of topic partitions, then partition and consumer mapping can be like below,

    Number of consumers same as the number of topic partitions

  • If the number of consumers is higher than the number of topic partitions, then partition and consumer mapping can be as seen below, Not effective, check Consumer 5

    Number of consumers more than number of topic partitions


  1. As the partitions created by the broker, therefore not a concern for the consumers?

Consumer should be aware of the number of partitions, as was discussed in question 3.


  1. Since this is a queue with an offset for each partition, is it the consumer's responsibility to specify which messages it wants to read? Does it need to save its state?

Kafka (to be specific Group Coordinator) takes care of the offset state by producing a message to an internal __consumer_offsets topic, this behavior can be configurable to manual as well by setting enable.auto.commit to false. In that case consumer.commitSync() and consumer.commitAsync() can help manage offset.

More about Group Coordinator:

  1. It's one of the elected brokers in the cluster from Kafka server side.
  2. Consumers interact with the Group Coordinator for offset commits and fetch requests.
  3. Consumer sends periodic heartbeats to Group Coordinator.

  1. What happens when a message is deleted from the queue? - For example, The retention was for 3 hours, then the time passes, how is the offset being handled on both sides?

If any consumer starts after the retention period, messages will be consumed as per auto.offset.reset configuration which could be latest/earliest. Technically, it's latest (start processing new messages), because all the messages got expired by that time and retention is a topic-level configuration.

Densify answered 13/8, 2018 at 19:16 Comment(12)
Hi ! I'm the author of the accepted answer, but I think yours is really nice too, most notably on point number 3 where the diagrams make things 200% clearer ! Do you think we should merge ?Libration
I meant that I (or you) could incorporate elements of your answer in mine, to get them more visibility and improve this (currently) top answer. But I wouldn't do it without your agreement !Libration
Why cannot map multi consumer to a partition? To ensure message just process for once? Thx for your answer.Kathykathye
@g10guang: It's because of difficulty in commit offset maintenance.Densify
Another scenario. You can have ONE partition and MULTIPLE consumers subscribed/assigned to it. The broker will deliver records to the first registered consumer only. But let's suppose the first consumer takes more time to process the task than the poll interval. The record consumption is not commited to the broker. The broker understands that the consumer hung out. In this state, the broker triggers a rebalancing sending the new assigned partitions to all its consumers. The message is consumed again by another consumer even though it is still being processed by C1. Be careful.Hereford
"In one consumer group, each partition will be processed by one consumer only." - @mrsrinivas: Regarding this statement , I think if a Consumer-Group has only 1 Consumer and the CG is connected to a Kafka a Topic of multiple partitions, then it should be possible for the single consumer to connect to multiple Partitions.Interaction
@Deb: it's true, one consumer can read multiple partitions. I am intended to say from partition side, one partion will be processed by one consumer only in a CG. Please let me if any better way to explain it.Densify
got it @DensifyInteraction
How does the PubSub works in this case? if only a single consumer can get attach to a specific partition, how does other consumers receive the same msg?Fayola
Lets say that I am providing a key on a message set, so each message I am providing a key on, lands inside the range of a single partition, (Say the first 100.000 messages I make, all land in partition 1) Then all my other partitions are useless. Then next 100.000 would land in partition 2, etc. Is there not a partition strategy that allows of a more efficient distribution of my messages? Like a round robin, with the exception that if a key is already assigned to a partition that partition has priority?Sizing
thanks for the question @MortenBork, version >= 2.4 onwards, setting RoundRobinPartitioner in the producer partition strategy should do the same. version < 2.4, a Custom partition strategy implementation is the way to achieve it. JIRA: issues.apache.org/jira/browse/KAFKA-3333Densify
Plagiarised in this 2022-12-11 blog post (a mashup using several sources).Guizot
L
170

Let's take those in order :)

1 - When a producer is producing a message - It will specify the topic it wants to send the message to, is that right? Does it care about partitions?

By default, the producer doesn't care about partitioning. You have the option to use a customized partitioner to have a better control, but it's totally optional.


2 - When a subscriber is running - Does it specify its group id so that it can be part of a cluster of consumers of the same topic or several topics that this group of consumers is interested in?

Yes, consumers join (or create if they're alone) a consumer group to share load. No two consumers in the same group will ever receive the same message.


3 - Does each consumer group have a corresponding partition on the broker or does each consumer have one?

Neither. All consumers in a consumer group are assigned a set of partitions, under two conditions : no two consumers in the same group have any partition in common - and the consumer group as a whole is assigned every existing partition.


4 - Are the partitions created by the broker, therefore not a concern for the consumers?

They're not, but you can see from 3 that it's totally useless to have more consumers than existing partitions, so it's your maximum parallelism level for consuming.


5 - Since this is a queue with an offset for each partition, is it responsibility of the consumer to specify which messages it wants to read? Does it need to save its state?

Yes, consumers save an offset per topic per partition. This is totally handled by Kafka, no worries about it.


6 - What happens when a message is deleted from the queue? - For example: The retention was for 3 hours, then the time passes, how is the offset being handled on both sides?

If a consumer ever request an offset not available for a partition on the brokers (for example, due to deletion), it enters an error mode, and ultimately reset itself for this partition to either the most recent or the oldest message available (depending on the auto.offset.reset configuration value), and continue working.

Libration answered 25/6, 2016 at 19:37 Comment(2)
Sry :) It's a bit hard explaining the whole kafka process in 500 chars boxes, I suggest reading kafka.apache.org/documentation.html#theconsumer (and probably the rest of the section 4, about kafka internals). Basically : the consumers request saving offsets, but those are saved elsewhere.Libration
I just read this and still this doesn't explains where it is held:Kafka handles this differently. Our topic is divided into a set of totally ordered partitions, each of which is consumed by one consumer at any given time. This means that the position of a consumer in each partition is just a single integer, the offset of the next message to consume. This makes the state about what has been consumed very small, just one number for each partition. This state can be periodically checkpointed. This makes the equivalent of message acknowledgements very cheap.Mickel
A
31

As explained in article What is Apache Kafka?

Kafka uses the Topic conception which comes to bringing order into the message flow.

To balance the load, a topic may be divided into multiple partitions and replicated across brokers.

Partitions are ordered, immutable sequences of messages that’s continually appended, i.e., a commit log.

Messages in the partition have a sequential id number that uniquely identifies each message within the partition.

Partitions allow a topic’s log to scale beyond a size that will fit on a single server (a broker) and act as the unit of parallelism.

The partitions of a topic are distributed over the brokers in the Kafka cluster where each broker handles data and requests for a share of the partitions.

Each partition is replicated across a configurable number of brokers to ensure fault tolerance.

Alternative answered 7/2, 2018 at 14:25 Comment(5)
Is Partition just for topic load balance?Kathykathye
@g10guang: partitions helps in processing messages in in parallel as well.Densify
Please correct me if I am wrong, when a message send by a producer and when it comes in the topic, it is copies it to the partitions as per the configurations and then consumer consumes it. Right?Pathetic
@Pathetic the message will get appended to 1 of the partitions for that Topic according to the current Partitioner configuration (by default the hash of the message key determines which partition the message goes to), and yes, a Consumer will pick up the message as it consumes messages from that partitionTedie
Didn't get this phrase: "Partitions allow a topic’s log to scale beyond a size that will fit on a single server (a broker) and act as the unit of parallelism"Theocritus
Z
2
  1. When a producer is producing a message - it will specify the topic it wants to send the message to, is that right? Does it care about partitions?

Yes, the Producer does specify the topic

producer.send(new ProducerRecord<byte[],byte[]>(topic,  partition, key1, value1) , callback);

The more partitions there are in a Kafka cluster, the higher the throughput one can achieve. A rough formula for picking the number of partitions is based on throughput. You measure the throughout that you can achieve on a single partition for production (call it p) and consumption (call it c).


  1. When a subscriber is running - does it specify its group id so that it can be part of a cluster of consumers of the same topic or several topics that this group of consumers is interested in?

When the Kafka consumer is constructed and group.id does not exist yet (i.e. there are no existing consumers that are part of the group), the consumer group will be created automatically. If all consumers in a group leave the group, the group is automatically destroyed.


  1. Does each consumer group have a corresponding partition on the broker or does each consumer have one?

Each consumer group is assigned a partition, multiple consumer groups can access a single partition, but not 2 consumers belonging to a consumer group are assigned the same partition because consumer consumes messages sequentially in a group and if multiple consumers from a single group consume messages from the same partition then sequence might be lost, whereas groups being logically independent can consume from the same partition.


  1. Are the partitions created by the broker, and therefore not a concern for the consumers?

Brokers already have partitions. Each broker to have up to 4,000 partitions and each cluster to have up to 200,000 partitions.

Whenever a consumer enters or leaves a consumer group, the brokers rebalance the partitions across consumers, meaning Kafka handles load balancing with respect to the number of partitions per application instance for you.

Before assigning partitions to a consumer, Kafka would first check if there are any existing consumers with the given group-id. When there are no existing consumers with the given group-id, it would assign all the partitions of that topic to this new consumer. When there are two consumers already with the given group-id and a third consumer wants to consume with the same group-id. It would assign the partitions equally among all three consumers. No two consumers of the same group-id would be assigned to the same partition

Source


  1. Since this is a queue with an offset for each partition, is it the responsibility of the consumer to specify which messages it wants to read? Does it need to save its state?

Offset is handled internally by Kafka. The current offset is a pointer to the last record that Kafka has already sent to a consumer in the most recent poll. So, the consumer doesn't get the same record twice because of the current offset. It doesn't need to be specified exclusively


  1. What happens when a message is deleted from the queue? - For example, the retention was for 3 hours, then the time passes, how is the offset being handled on both sides?

It automatically reconfigures themselves according to need. It should give an error.

Ziska answered 2/6, 2021 at 13:27 Comment(0)
F
0

When a producer is producing a message, it will specify the topic it wants to send the message to. Is that right? Does it care about partitions?

There is a change in this strategy, that is mentioned in other answers about Round Robin.

In versions of Apache Kafka prior to 2.4, the partitioning strategy for messages without keys involved cycling through the partitions of the topic and sending a record to each one. However, this approach had drawbacks in terms of batching efficiency and potential latency issues.

One specific concern was the increased latency experienced with small batches of records when using the original partitioning strategy. This was due to the overhead of cycling through partitions for each individual record.

To address this issue, Apache Kafka version 2.4 introduced a new partitioning strategy called "sticky partitioning" This strategy aims to assign records to partitions in a more efficient manner, reducing latency.

With sticky partitioning, records with null keys are assigned to specific partitions, rather than cycling through all partitions. This approach leverages the concept of "stickiness," where records without keys are consistently routed to the same partitions based on certain criteria.

This is dependent on linger.ms and batch.size.

Even when linger.ms is 0, the producer will group records into batches when they are produced to the same partition around the same time.

For more details check this link : https://www.confluent.io/blog/apache-kafka-producer-improvements-sticky-partitioner/

and this video for other details to kafka : https://www.youtube.com/watch?v=DkYNfb5-L9o&ab_channel=Devoxx

Kafka partitioner configs

Flexor answered 27/5, 2023 at 4:27 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.