Kafka Streams deleting consumed repartition records, to reduce disk usage

We have a kafka instance with about 50M records, with about 100k input per day, so nothing crazy in kafka-world. When we want to reprocess these records with one of our more complex stream apps (with many different steps of aggregation), the disk usage gets pretty crazy from the repartition topics. Theese topics uses the standard retention time (14 days?) in kafka-streams 1.0.1 and Long.Max in 2.1.1 from what we have understood. This is very inconvenient since for the repartition topics, in our case, each record is only read once when the aggregation is done and after that it can be deleted.

So our question is if there is any way of to configure a setting in kafka-streams that purges records after they have been processed? I have seen that there is some way to do this with purgeDataBefore() (https://issues.apache.org/jira/browse/KAFKA-4586).

For reference, some sizes in a part of the app:

table-1 (changelog, compact ~ 2GB) --> change key and aggregate (repartition ~ 14GB) --> table-2 (changelog, delete, 14KB) --> change key and aggregate (repartition 21GB) --> table-3 (changelog, compact, 0.5GB)

(This is my first stack overflow question so any feedback is appreciated, thanks in advance!)

Recommended topics

Hot tags