Main question: we run Kafka Streams (Java) apps on Kubernetes to consume, process and produce real time data in our Kafka Cluster (running Confluent Community Edition v7.0/Kafka v3.0). How can we do a deployment of our apps in a way that limits downtime on consuming records? Our initial target was approx 2 sec
downtime a single time for each task.
We are aiming to do continuous deployments of changes to the production environment, but deployments are too disruptive by causing downtime in record consumption in our apps, leading to latency in produced real time records.
We have tried different strategies to see how it affects latency (downtime).
Strategy #1:
- Terminate all app instances (total of 6)
- Immediately start all new app instances
- Result: measured max latency for consumed records:
85 sec
Strategy #2:
- Start one new app instance
- Wait
3 minutes
to allow restoring the local state in the new app instance - After
3 minutes
terminate one old app instance - Repeat until all old app instances are terminated
- Result: measured max latency for consumed records:
39 sec
Strategy #3:
- Same as Strategy #2, but increase wait time to
15 minutes
- Result: measured max latency for consumed records:
7 sec
. However15 minutes
per app instance will lead to 15 min x 6 instance =90 minutes
to deploy a change + additional30 minutes
for the incremental rebalancing protocol to finish. We find the deploy time to be quite excessive.
We have been reading the KIP-429: Kafka Consumer Incremental Rebalance Protocol and tried to configure the app to support our use case.
Following are the key Kafka Streams configuration we did for Strategy #2 and #3:
acceptable.recovery.lag: 6000
num.standby.replicas: 2
max.warmup.replicas: 6
probing.rebalance.interval.ms: 60000
num.stream.threads: 2
Input topic has 12 partitions
and message rate of 800 records/s
in average. There are 3 Kafka Streams Key Value State Stores, where two have the same rate as input topic. These two are roughly of 20GB size
. Size of key set is about 4000
. In theory acceptable.recovery.lag
above should be ~60
sec latency on changelog topic per partition.
Here are some metrics (up, rate messages received and latency of messages received) per app instance for Strategy #3:
Notable observations we did:
1a
- first new app instance starts
1b
- Kafka immediately rebalance, 2 old app instances gets assigned more tasks and 2 old looses tasks
1c
- at the same time max latency recorded increases from 0.2 sec
to 3.5 sec
(this indicates that rebalance takes approx 3 sec
)
2
- probing rebalance occurs and Kafka Streams somehow decides to revoke a task from one of the old app instances and give it to one that already has the most tasks
3
- last old app instance is terminated
4
- all partitions are rebalanced as prior to the upgrade, the incremental rebalance is completed (approx 33 minutes
after last app instance terminated)
Other
- it takes approx 40 minutes
for the first new app instance to get assigned a task. In addition every task is re-assigned to multiple times, causing many small disruptions of 3 sec
.
We can provide much more details about the topology, topics, configuration and metrics diagrams for the other strategies, if needed (this thread is already huge).