Kafka Streams deleting consumed repartition records, to reduce disk usage
Asked Answered
C

1

1

We have a kafka instance with about 50M records, with about 100k input per day, so nothing crazy in kafka-world. When we want to reprocess these records with one of our more complex stream apps (with many different steps of aggregation), the disk usage gets pretty crazy from the repartition topics. Theese topics uses the standard retention time (14 days?) in kafka-streams 1.0.1 and Long.Max in 2.1.1 from what we have understood. This is very inconvenient since for the repartition topics, in our case, each record is only read once when the aggregation is done and after that it can be deleted.

So our question is if there is any way of to configure a setting in kafka-streams that purges records after they have been processed? I have seen that there is some way to do this with purgeDataBefore() (https://issues.apache.org/jira/browse/KAFKA-4586).

For reference, some sizes in a part of the app:

table-1 (changelog, compact ~ 2GB) --> change key and aggregate (repartition ~ 14GB) --> table-2 (changelog, delete, 14KB) --> change key and aggregate (repartition 21GB) --> table-3 (changelog, compact, 0.5GB)

(This is my first stack overflow question so any feedback is appreciated, thanks in advance!)

Culpepper answered 15/3, 2019 at 12:20 Comment(0)
K
2

Kafka Streams uses the purgeDataBefore() API since 1.1 release: https://issues.apache.org/jira/browse/KAFKA-6150

You don't need to enable it (and you cannot disable it either).

Kinsman answered 17/3, 2019 at 17:38 Comment(4)
Do you know then how it is possible that the repartition topics grow very large even when the aggregation keeps up?Culpepper
I realized we downgraded to 1.0.1 because we had issues with rolling logs in 2.1.1. I might ask a question on this later. Thank you very much for the answer!Culpepper
Data purging only works for older segments, but not the active segment. Default segment size is 1GB -- since 1.1, Kafka Streams creates repartition topics with segment size 50MB to make purging more effective. Maybe you need to manually reconfigure the repartition topics.Kinsman
Thanks, again! That is very helpful, we will try to upgrade again to 2.1.1 on our test server and try to see if we make it work with some higher repartition segment sizes, even though they did seem to create way smaller segments than 50MB.Culpepper

© 2022 - 2024 — McMap. All rights reserved.