What is the difference between LEO and HW
in Replica ( Leader Replica
)?
Will they contain the same number? I can understand HW is the last committed message offset
.
When LEO will be updated and how?
What is the difference between LEO and HW
in Replica ( Leader Replica
)?
Will they contain the same number? I can understand HW is the last committed message offset
.
When LEO will be updated and how?
The high watermark indicates the offset of messages that are fully replicated, while the end-of-log offset might be larger if there are newly appended records to the leader partition which are not replicated yet.
Consumers can only consume messages up to the high watermark.
See this blog post for more details: http://www.confluent.io/blog/hands-free-kafka-replication-a-lesson-in-operational-simplicity/
Let's start with one of the most popular watermark definition that could be find on Google
The high watermark is the offset of the last message that was successfully copied to all of the log’s replicas
I was not very well convinced of the above definition and going deeper in my research I found this nice picture:
What was wrong then? The stuck follower at the extreme right in the pic didn't have the fourth message logged. Maybe the first definition Google found was not complete and what the author did actually intend was: "The high watermark is the offset of the last message that was successfully copied to all of the log’s in-sync replicas"
Guided by this intuition I found this article providing details about how the WM is computed together with code.
I found the WM definition reported much more precise:
High watermark is calculated as the minimum LEO across all the ISR of this partition, and it grows monotonically.
This answer together with the code provided confirmed the intuition I had.
Summing up I think the detailed definition of the watermark shows what is the difference between LEO and WM. The latest committed offset and LEO might coincide with high water mark for in sync follower but may very well not for Leader as shown by the example in the first linked image.
In general, your question has already been answered by the other people. I'd like to touch a few other things that are important to those two parameters.
It's not too complicated to reason about these two settings, if you only think that reading and writing will be done via the leader replica.
If you on the other hand know about this one, then things are becoming a bit more complicated, but only slightly. This KIP says that a consumer can be brought up to consume messages not only from the leader, but from a replica too.
As such, there are a few things that matter here:
At least HW
is tracked in each individual replica, otherwise (with that KIP in mind), you could consume messages that are ahead of other replicas, and have not yet been replicated.
One more reason of why HW
is tracked on each replica is important when you think about the case that one replica goes down and another one is elected as leader. What should the value of LEO
be? Latest HW
would be a perfect use-case.
One more interesting thing is what happens when you connect your consumer to a replica that is part of replication factor, but not as part of min.insync
. When you produce a message, the message will be replicated sync in min.insync
replicas, and async in the ones part of replication factor. (Think that A, B
= min insync and A,B,C
= replication factor 3). In this case the producer can successfully write to A
and B
and report that the message has been written, while your consumer that is attached to C
can see it only later, when HW
is updated.
© 2022 - 2024 — McMap. All rights reserved.