Purpose of statestore and changelog topic in kafka streams?
Asked Answered
J

2

7

I have a kafka stream application in which it is using stateStore (backed by RocksDB).

All what stream thread is doing is getting data from kafka topic and putting the data into state-store. (There is other thread which read data from statestore and does business logic processing).

I observed it creates a new kafka topic "changelog" because of stateStore.

But I didn't get what purpose "changelog" kafka topic serves?

  • Why is it (changelog) needed?
  • What is relationship between statestore and "changelog" kafka topics?
  • Who puts data into this topic? ("changelog")
Jesseniajessey answered 1/5, 2020 at 21:52 Comment(1)
Basically, changelog is used by Kafka. In the events of failures, the state of your application can be recreated from the changelog. That's why statestore writes to changelog. This page makes it pretty clearRoadblock
A
2

When you enable change logging for a state store, Kafka Streams captures changes to the state and writes them to a changelog topic in Kafka. This changelog topic acts as a durable and fault-tolerant storage for the state, allowing the state to be restored in case of application restarts or failures.

Lets take word count example.

Initial State:

  • Word: "hello", Count: 1
  • Word: "world", Count: 1

Change Log Entries:

When a word is processed multiple times, the state store updates the count for that word, and these updates are written to the changelog topic.

  • Update for "hello":
    • Word: "hello", Count: 2
  • Update for "world":
    • Word: "world", Count: 2
  • Update for "hello" again:
    • Word: "hello", Count: 3

Changelog Topic:

The changelog topic for the word-count-store might contain records like the following

  • Key: "hello", Value: 1 (Initial state)
  • Key: "world", Value: 1 (Initial state)
  • Key: "hello", Value: 2 (Update)
  • Key: "world", Value: 2 (Update)
  • Key: "hello", Value: 3 (Update)

Restore State:

If the Kafka Streams application restarts or fails over to another instance, it can restore the state of the word-count-store by replaying the changelog topic from the beginning. This ensures that the state is consistent and up-to-date across application instances.

Compact Topics:

To optimize storage and reduce the volume of change log data, it can be configured to use log compaction. This ensures that only the latest update for each key is retained in the changelog topic, allowing the state to be fully restored while minimizing storage requirements.

Altonaltona answered 10/1 at 14:30 Comment(0)
S
12

Short answer to this question is to achieve fault tolerance.

Details:

changelog enables the State Store in your Kafka Streams application to be fault tolerant. As your application ingests more data into the state store, it gets pushed to the changelog topic, so that if the node that is running the application goes down, then the changelog topic is used to load the state store with the latest state.

Each application thread or instance gets it's own changelog topic partition so that every instance can recreate it's state after the application is restarted post failure.

The data is getting pushed to the topic automatically by Kafka Streams as and when there are updates made to the state store.

I would suggest going through the Chapter 11 of Kafka Definitive Guide - it contains a pretty good explanation of the Kafka Streams architecture and the stream processing patterns.

Hope this helps.

Soothsayer answered 2/5, 2020 at 3:27 Comment(2)
does it mean that data from changelog-topic will be shared with other nodes, so in case of a broker, which goes down, a other broker will recreate new local-state-store from this shared data of a changelog-topic?Nason
Well. Yes - Ideally, you would want to use the same replication factor for changelog topic as your other topics so that the data is available in multiple brokers to provide fault tolerance.Soothsayer
A
2

When you enable change logging for a state store, Kafka Streams captures changes to the state and writes them to a changelog topic in Kafka. This changelog topic acts as a durable and fault-tolerant storage for the state, allowing the state to be restored in case of application restarts or failures.

Lets take word count example.

Initial State:

  • Word: "hello", Count: 1
  • Word: "world", Count: 1

Change Log Entries:

When a word is processed multiple times, the state store updates the count for that word, and these updates are written to the changelog topic.

  • Update for "hello":
    • Word: "hello", Count: 2
  • Update for "world":
    • Word: "world", Count: 2
  • Update for "hello" again:
    • Word: "hello", Count: 3

Changelog Topic:

The changelog topic for the word-count-store might contain records like the following

  • Key: "hello", Value: 1 (Initial state)
  • Key: "world", Value: 1 (Initial state)
  • Key: "hello", Value: 2 (Update)
  • Key: "world", Value: 2 (Update)
  • Key: "hello", Value: 3 (Update)

Restore State:

If the Kafka Streams application restarts or fails over to another instance, it can restore the state of the word-count-store by replaying the changelog topic from the beginning. This ensures that the state is consistent and up-to-date across application instances.

Compact Topics:

To optimize storage and reduce the volume of change log data, it can be configured to use log compaction. This ensures that only the latest update for each key is retained in the changelog topic, allowing the state to be fully restored while minimizing storage requirements.

Altonaltona answered 10/1 at 14:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.