Kafka producer - How to change a topic without down-time and preserving message ordering?
Asked Answered
H

1

2

This question is about architecture and kafka topics migrating.

Original problem: schema evolution without backward compatibility.

https://docs.confluent.io/current/schema-registry/avro.html

I am asking the community to give me an advice or share articles from which I can get inspired and maybe think of a solution to my problem. Maybe there is an architecture or streaming pattern. It is not necessary to give me a language specific solution; just give me a direction into which I can go... My question is big, it may be interesting for those who later want

  • a) change message format and produce message into a new topic.
  • b) stop producing message into one topic and start producing messages into another topic "instantly"; in other words once a message in v2 was produced, no new messages are appended into v1.

Problem

I am changing message format, which is not compatible with the pervious version. In order not to break existing consumers, I decided to produce message to a new topic.

Up-caster idea

I have read about an up-caster.

https://docs.axoniq.io/reference-guide/operations-guide/production-considerations/versioning-events

Formal task

Let v1 and v2 be the topics. Currently, I produce messages in the format format_v1 into the topic v1. I want to produce messages in the format format_v2 into the topic v2. The switch should happen at some moment of time which I can choose.

In other words, at some moment of time, all instances of the producer stop sending messages into v1, and start sending messages into v2; thus the last message m1 in v1 is produced before the first message of m2 in v2.

Details

I got an idea, that I can produce messages to the topic v1 have a kafka steam up-caster that is subscribed to v1 and pushes transformed messages to v2. Let assume that the transformer (in my case of course) is able to transform message of format_v1 into format_v2 without errors.

As described in the link above about avro schema evolution, by the time I have added an up-caster and produce messages into v1, I have all my consumers of v1 changed into v2.

Now, a tricky part. We have two requirements:

1. No production down-time.

2. Preserve message ordering.

It means:

1) We are not allowed to lose messages; a client may use our system at any time, so our system should produce a message at any time.

2) We are running multiple instances of the producer. At some moment of time there can (potentially) be producers that may produce messages of format format_v1 into the topic v1, and some instances that produce messages of format format_v2 into the topic v2.

As we know, kafka does not guarantee message ordering for different partitions and topics.

I can solve the problem with partitions by writing message into v2 with the same partition selector as for v1. Or for now, I can imagine that we use just one partition for v1 and one partition for v2.


My simplifications and attempts

1) I imagined that by the moment I want to change the producer to produce messages into a new topic, I have an up-caster (kafka stream component) that is capable of transforming messages from v1 into v2 without error. This kafka stream component is scalable.

2) All my consumers have been already switched into v2 topic. They constantly receive messages from v2. At this moment of time, my producer instances are producing messages into the topic v1 and up-caster does its job well.

3) To simplify the problem, let's imagine that for now format_v1 and format_v2 do not matter, and they are the same.

4) Let's imagine we have one partition for v1 and one partition for v2.

Now my problem, how to instantly switch all producers that from a given point of time; all the instances produce messages into the topic v2.

My colleague and kafka expert told me that with down-time it can be done

If you rely on the order of the messages in the partitions, you cannot switch to the new version without down time. To make down time minimal we can do the following.

Upcaster component must write the data to the same partitions and should try to make the same offsets. However it is not always possible, as offsets may have gaps, so the mapping between old offsets, and new offsets must be kept. No all the records, only the last bulk for each partition. If upcaster crashes, just start again, producer is still not involved in v2.

Start the v2 consumer. If it starts with the same consumer group as v1, nothing should be done, if it has new consumer group, update offsets in Kafka according to the new offsets.

Now Producers writes to v1, upcaster converts the data, consumer consumes from v2

Here comes down time. When the lag of upcaster is close to 0, shutdown v1 producer, wait until upcaster converts rest of the records, shutdown upcaster, start v2 producer, which writes to v2 topic.

I though of manual manipulation in the database (via some rest-endpoint or etc) to change a flag; producers always check the flag before they produce messages. When the flag says v2 or true, the producer will start writing messages into v2. However, what if at moment of time the flag is false one producers starts producing message into v1, then the flag has changed and another producer has sent a message into v2 before the previous producer finished producing into v1.

Hazelwood answered 28/2, 2020 at 17:36 Comment(0)
L
1

Is it acceptable for you to have only one producer being active?

In that case you can use your idea with a flag:

  1. Shut down all producers p2,p3,...,pn except p1
  2. p1 writes to v1 alone
  3. Switch the flag to v2, so p1 ends its last write to v1 and starts writing to v2
  4. Now nobody writes to v1
  5. Start your other producers p2,p3,...,pn
  6. Every producer writes now because of the active flag to v2 and still nobody to v1
Laity answered 11/3, 2020 at 9:39 Comment(3)
I agree, it is possible, thanks. I will make more research and see maybe there are other advice.Hazelwood
There is one drawback though. Look, I do not mean to criticize your answer, I just want everybody who reads can learn. The good point of the solution above, it is doable and not complex (can be done and it is clear how!). On the other side, imagine you have multiple environments: develop, staging, test, pre-prod, prod. What it means that there are jobs that need to be done: for each environment: stop producers except one, switch feature flag, start producers again. Just think, how to automate all these jobs (or reduce amount of manual work). And in this case, it will be amazing!Hazelwood
Yes you're right. I wouldn't recommend to do these steps manually. Automate everything you can what you plan to do multiple times to possibly save time and prevent errors. In my naive imagination such a script for automation would be very easy to implement because each step is simple.Laity

© 2022 - 2024 — McMap. All rights reserved.