Scaling Kafka for Microservices
Asked Answered
P

1

7

Problem

I want to understand how I need to design the message passing for microservices to be still elastic and scalable.

Goal

  • Microservices allow an elastic number of instances which are scaled up and down automatically based on the current load. This characteristic should not be limited by Kafka.
  • We need to guarantee an at-least-once delivery
  • We need to guarantee the delivery order of events concerning the same entity

Running example

  • For simplicitly, let's say there are 2 microservices A and B;
  • A1, A2 being instances of microservices A; B1 and B2 instances of microservices B
  • A1 and A2 publish events describing CRUD operations on entities of A, like entity was created, updated, deleted.

Design I

  • A1 publishes events under topic a.entity.created including information like the id of the entity that was created in the message body.
  • A1 has to specify how many partitions (p) there will be in order to allow parallel consumption of consumers. This will allow for scalability.
  • B1 and B2 subscirbe to the topic a.entity.created as a consumer group b-consumers. This leads to a load distribution between instances of B.

Questions:

  • a) Events concerning the same entity might be processed in parallel and might get out of order, right?
  • b) This will lead to instances of B dealing with requests in parallel. The limitting factor is how p (amount of partitions) is defined by the producers. If there are 5 partitions, but I'd need 8 consumers to cope with the load, it won't work since 3 consumers will not receive events. Did I understand this correctly? This would IMO be unusable for elastic microservices which might want to scale further.

Design II

  • A1 publishes events under topic a.entity.created.{entityId} leading to a lot of different topics.
  • Each partition size is set to 1.
  • B1 and B2 subscribe to the topic a.entity.created.* with a wildcard as a consumer group b-consumers. This leads to a load distribution between instances of B.

Questions:

  • c) Events concering the same entity should be guaranteed to be delivered in order since there is only one partition, right?
  • d) How is scalability handled here? The amount of partitions shouldn't limit the amount of consumers, or does it?
  • e) Is there a better way to guarantee the goals mentioned above?

Design III (thx to StuartLC)

  • A1 publishes event under topic a.entity.created and partition key based on the entityId.
  • Each partition size is set to 10.
  • B1 and B2 subscirbe to the topic a.entity.created as a consumer group b-consumers. This leads to a load distribution between instances of B. Events regarding the same entity will be delivered in order thanks to the partition key.

Questions:

  • f) With p=10 I can have max 10 consumers. This means I have to estimate the number of consumers at design time / deploy time if using env variables. Can I move that to runtime anyhow so that is adjusted dynamically?
Prestige answered 18/5, 2021 at 8:35 Comment(0)
R
4

You don't need a separate topic per entity to ensure delivery order of messages.

Instead, assign a partition key on each message based on your entity id (or other immutable identifier for the entity) to ensure that events concerning the same entity are always delivered to the same partition on each topic.

Chronological ordering of messages on the partition will be generally preserved even if more than one producer instance publishes messages for the same entity (e.g. if A1 and A2 both publish messages for entity E1, the partition key for E1 will ensure that all E1 messages get delivered to the same parition, P1.). There are some edge cases where ordering will not be preserved (e.g. loss of connectivity of a producer), in which case you might look at enabling idempotence.

You are right, at any one time, at most one consumer from a consumer group will be subscribed to a single topic partition, although a consumer could be assigned to more than one partition. So e.g. consumer B2 might process all messages from partition P1 (in sequential order), as well as from another partition. If a consumer dies, then another consumer will be assigned to the partition (the transition can be several seconds).

When a partition key is provided on messages, by default, the message to partition assignment is done based on a hash of the partition key. In rare scenarios (e.g. very few entity partition keys), this could result in an uneven distribution of messages on partitions, in which case you could provide your own partitioning strategy, if you find that throughput is affected by the imbalance.

Based on the above, after configuring an appropriate number of partitions for your scenario, you should be able to meet your design goals using consistent partition keys without needing to do too much customisation to Kafka at all.

Redress answered 18/5, 2021 at 10:37 Comment(5)
Thanks! I totally missed the feature of partition keys in my strategy. Part of question b) is still not quite clear to me: the amount of possible consumers is still determined by the number of partitions, right? So I have to know beforehand how much instances of the consuming microservice B I will probably need at max. Is there a strategy to avoid that?Prestige
In general, you'll want to have the same number of consumer processes (with the same group id) as there are partitions on the topic (more consumers will mean there's redundant consumers doing nothing, and less consumers means some consumers get more than one partition assigned). However the number of consumers and partitions you need will depend on the actual throughput you need (and your HW budget). Depending on what your consumers are doing, you might just find there's another downstream bottleneck (e.g. database or api contention) - Kafka isnt usually the limiting factor in an enterprise.Redress
@StuartLC, I have seen id based partitioning causing uneven distribution of messages, whats worse, this caused a sort of lock-in for the devs to scale their consumers vertically rather than horizontally causing resource wastage (also wasnt as effective as well) and finally leading to complete architecture overhaul. Would it be better to understand what the id signifies before using it as a key? Or is there another way?Jumper
As per above, hashing of the partition key, should distribute 'entities' vs partitions fairly uniformly, provided that there's a large number of entities and an equal number of messages per entity. If this doesn't work for you, come up with your own partitioning scheme (i.e. explicitly assign a message to a partition) Obviously, if you don't need to guarantee ordered delivery, leaving the partition key as null will revert Kafka to it's default round-robin distrubution, which will result in equal distribution.Redress
Okay, probably I am too unspecific here. Let's say I have an application that has very unusual load spikes, e.g. 1 hour a day. Then I always have to estimate the highest load and configure partitions accordingly? Isn't there a way to do that dynamically - or in other words: adding/removing partitions depending on the load? Otherwise you have a lot of resources just sitting around and doing nothing I'd guess..Prestige

© 2022 - 2024 — McMap. All rights reserved.