How to choose the no of partitions for a kafka topic?
Asked Answered
R

6

15

We have 3 zk nodes cluster and 7 brokers. Now we have to create a topic and have to create partitions for this topic.

But I did not find any formula to decide that how much partitions should I create for this topic. Rate of producer is 5k messages/sec and size of each message is 130 Bytes.

Thanks In Advance

Refreshment answered 10/5, 2018 at 11:16 Comment(5)
5k messages/sec from a single producer? Or overall from all threads on all possible producers (assuming more than one)?Ultimo
@cricket_007 Thanks for your response. we have 5 producers which produce 5k messages/sec.Refreshment
And what about your key distribution? Null keys? Some known value?Ultimo
@cricket_007 Thanks for response. Sir We are not specifying any key for partitions.Yes.. Null keys.Refreshment
So, 5 producers will just round-robin however many partitions its able to, which means, if you run a network benchmark, say you get 1Gbps output from a producer network card, then you can send up to 1G/(5k*130) bytes per second... And keep going with that math if you want to optimize production throughput, keeping in mind that topics are more often consumed than produced, so you don't want to saturate the broker network interface only producing messagesUltimo
S
9

I can't give you a definitive answer, there are many patterns and constraints that can affect the answer, but here are some of the things you might want to take into account:

  • The unit of parallelism is the partition, so if you know the average processing time per message, then you should be able to calculate the number of partitions required to keep up. For example if each message takes 100ms to process and you receive 5k a second then you'll need at least 50 partitions. Add a percentage more that that to cope with peaks and variable infrastructure performance. Queuing Theory can give you the math to calculate your parallelism needs.

  • How bursty is your traffic and what latency constraints do you have? Considering the last point, if you also have latency requirements then you may need to scale out your partitions to cope with your peak rate of traffic.

  • If you use any data locality patterns or require ordering of messages then you need to consider future traffic growth. For example, you deal with customer data and use your customer id as a partition key, and depend on each customer always being routed to the same partition. Perhaps for event sourcing or simply to ensure each change is applied in the right order. Well, if you add new partitions later on to cope with a higher rate of messages, then each customer will likely be routed to a different partition now. This can introduce a few headaches regarding guaranteed message ordering as a customer exists on two partitions. So you want to create enough partitions for future growth. Just remember that is easy to scale out and in consumers, but partitions need some planning, so go on the safe side and be future proof.

  • Having thousands of partitions can increase overall latency.

Swordtail answered 10/5, 2018 at 12:3 Comment(2)
Is it 50 partitions or 500?Mandatory
based on math it should be 500.Semiaquatic
T
7

This old benchmark by Kafka co-founder is pretty nice to understand the magnitudes of scale - https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines

The immediate conclusion from this, like Vanlightly said here, is that the consumer handling time is the most important factor in deciding on number of partition (since you are not close to challenge the producer throughput).

maximal concurrency for consuming is the number of partitions, so you want to make sure that:

((processing time for one message in seconds x number of msgs per second) / num of partitions) << 1

if it equals to 1, you cannot read faster than writing, and this is without mentioning bursts of messages and failures\downtime of consumers. so you will need to it to be significantly lower than 1, how significant depends on the latency that your system can endure.

Tody answered 13/10, 2019 at 20:36 Comment(0)
K
4

It depends on your required throughput, cluster size, hardware specifications:

There is a clear blog about this written by Jun Rao from Confluent: How to choose the number of topics/partitions in a Kafka cluster?

Also this might be helpful to have an insight: Apache Kafka Supports 200K Partitions Per Cluster

Kramer answered 17/12, 2018 at 15:50 Comment(0)
C
4

For example, if you want to be able to read 1000MB/sec, but your consumer is only able process 50 MB/sec, then you need at least 20 partitions and 20 consumers in the consumer group. Similarly, if you want to achieve the same for producers, and 1 producer can only write at 100 MB/sec, you need 10 partitions. In this case, if you have 20 partitions, you can maintain 1 GB/sec for producing and consuming messages. You should adjust the exact number of partitions to number of consumers or producers, so that each consumer and producer achieve their target throughput.

So a simple formula could be:

#Partitions = max(NP, NC) where:

NP is the number of required producers determined by calculating: TT/TP

NC is the number of required consumers determined by calculating: TT/TC

TT is the total expected throughput for our system

TP is the max throughput of a single producer to a single partition

TC is the max throughput of a single consumer from a single partition

source : https://docs.cloudera.com/runtime/7.2.10/kafka-performance-tuning/topics/kafka-tune-sizing-partition-number.html

Custodian answered 22/11, 2021 at 12:45 Comment(1)
It is all about consumer. Not about the producer. Number of partitions should not affect producer scalingBiforked
M
3

Partitions = max(NP, NC)

where:

NP is the number of required producers determined by calculating: TT/TP NC is the number of required consumers determined by calculating: TT/TC TT is the total expected throughput for our system TP is the max throughput of a single producer to a single partition TC is the max throughput of a single consumer from a single partition

Manchu answered 20/4, 2020 at 17:33 Comment(1)
It is all about consumer. Not about the producer. Number of partitions should not affect producer scalingBiforked
E
0

You could choose the no of partitions equal to maximum of {throughput/#producer ; throughput/#consumer}. The throughput is calculated by message volume per second. Here you have: Throughput = 5k * 130bytes = 650MB/s

Elspet answered 6/8, 2021 at 3:56 Comment(1)
It's 650 KB/s, not 650 MB/s.Longrange

© 2022 - 2025 — McMap. All rights reserved.