How is ETCD a highly available system, even though it uses Raft which is a CP algorithm?

Asked 14/12, 2021 at 7:19 Answered 15/5, 2023 at 13:1

Solved kubernetes distributed-system etcd raft cap-theorem

Consistent and highly-available key value store used as Kubernetes' backing store for all cluster data.

Does Kubernetes have a separate mechanism internally to make ETCD more available? or does ETCD use, let's say, a modified version of Raft that allows this superpower?

Convulsant answered 14/12, 2021 at 7:19 Comment(2)

What do you mean by "CP algorithm?" from the title? Could you explain it? Raft is a consensus algorithm. – Witter 14/12, 2021 at 13:40

It was in the context of CAP theorem. I can explain it here but you'll get more clarity if you look it up yourself. – Convulsant 14/12, 2021 at 17:40

When it comes to going into etcd details, it is best to use the official etcd documentation:

etcd is a strongly consistent, distributed key-value store that provides a reliable way to store data that needs to be accessed by a distributed system or cluster of machines. It gracefully handles leader elections during network partitions and can tolerate machine failure, even in the leader node.

There is no mention here that this is high-availability. As for the fault tolerance, you will find a very good paragraph on this topic here:

An etcd cluster operates so long as a member quorum can be established. If quorum is lost through transient network failures (e.g., partitions), etcd automatically and safely resumes once the network recovers and restores quorum; Raft enforces cluster consistency. For power loss, etcd persists the Raft log to disk; etcd replays the log to the point of failure and resumes cluster participation. For permanent hardware failure, the node may be removed from the cluster through runtime reconfiguration.

It is recommended to have an odd number of members in a cluster. An odd-size cluster tolerates the same number of failures as an even-size cluster but with fewer nodes.

You can also find very good article about understanding etcd:

Etcd is a strongly consistent system. It provides Linearizable reads and writes, and Serializable isolation for transactions. Expressed more specifically, in terms of the PACELC theorem, an extension of the ideas expressed in the CAP theorem, it is a CP/EC system. It optimizes for consistency over latency in normal situations and consistency over availability in the case of a partition.

Look also at this picture: enter image description here

Berkow answered 15/12, 2021 at 11:59 Comment(2)

So, ETCD is not highly available is what you're saying, and K8s documentation needs correction? – Convulsant 16/12, 2021 at 4:57

This is what it looks like. The k8s documentation looks not very precise. – Witter 17/12, 2021 at 8:5

I believe it's important to keep in mind that the CAP theorem proves the limits of a system. E.g. in the usual case where you may get network partitions (P), you can't have a fully available (A) and at the same time fully consistent (C) system.

In practice though, different systems make different tradeoffs between availability A and consistency C, so it's more like a spectrum than a binary choice.

According to its homepage, etcd is strongly-consistent, meaning it tries to ensure consistency even at the expense of availability (and therefore leans towards C on the C-A spectrum).

That doesn't mean etcd can't be "highly-available" in the traditional IT sense but there's certainly other databases that are much more available in the CAP theorem sense (usually called eventually consistent databases). What do I mean by that?

Let's say you have a 3-node etcd cluster and 1 node gets partitioned from the other two. If you try to use this separate node to make some changes, it won't let you because it doesn't have (Raft) quorum - this means it's not available in the CAP sense. But it doesn't mean that you can't use the rest of the cluster as you would normally (so you get "high availability" in the traditional sense).

Of course, if you lose the two nodes in a fire or some such accident, then you can't use this cluster any longer, even though you still have 1 node (again, the system is not available in the CAP sense).

So, in etcd case, what people usually mean with "high-availability", is that it can keep working as long as you (e.g. k8s control node) are in the network partition with at least half the etcd nodes. In the most simplistic sense it means "you can lose up to X nodes and the cluster will survive".

Erna answered 15/5, 2023 at 13:1 Comment(0)

Recommended topics

Hot tags