What are the implications of R + W > N for Cassandra clusters?

Asked 19/10, 2011 at 6:44 Answered 15/12, 2019 at 18:3

Solved replication cassandra cluster-computing

This introduction to Cassandra Replication and Consistency (slides 14-15) boldly asserts:

R+W>N guarantees overlap of read and write quorums.

Please imagine this inequality has huge fangs, dripping with the blood of innocent, enterprise developers so you can best appreciate the terror it inspires.

I understand having the sum of the read and write Consistency Levels (R+W) greater than Replication Factor (N) is a good idea... but what's the big deal?

What are the implications, and how does R+W>N compare to the alternatives?

R+W < N
R+W = N
R+W >> N

Glutamine answered 19/10, 2011 at 6:44 Comment(0)

The basic problem we are trying to solve is this:

Can a situation occur in which a read doesn't return the most up-to-date value?

Obviously, this is best avoided if possible!

If R+W <= N, then this situation can occur.

A write could send a new value to one group of nodes, while a subsequent read could read from a completely separate group of nodes, and thus miss the new value written.

If R+W > N, then this situation is guaranteed not to occur.

There are N nodes that might hold the value. A write contacts at least W nodes - place a "write" sticker on each of these. A subsequent read contacts at least R nodes - place a "read" sticker on each of these. There are R+W stickers but only N nodes, so at least one node must have both stickers. That is, at least one node participates in both the read and the write, so is able to return the latest write to the read operation.

R+W >> N is impossible.

The maximum number of nodes that you can read from, or write to, is N (the replication factor, by definition). So the most we can have is R = N and W = N, i.e. R+W = 2N. This corresponds to reading and writing at ConsistencyLevel ALL. That is, you just write to all the nodes and read from all the nodes, nothing fancy happens.

Rancell answered 19/10, 2011 at 14:49 Comment(3)

Good explanation - just wanted to add that that Read Repair can update the nodes, in the background, so that the next read will be up-to-date. – Macedo 23/10, 2011 at 14:4

@Macedo just note that read repair is not done in all cases. It's controlled by read_repair_chance which is set to 0.1 by default. So you have 10% chance, that the next read will be up-to-date. – Marius 21/10, 2017 at 16:33

good answer . if R+W=2N, It's the same as one single node, doesn't add any fault tolerance. – Anastice 18/4, 2020 at 6:32

Quorum write and Quorum read allow to detect stale values in a leaderless replication system.

For example, we have 3 replicators A, B, C (N=3). C is down during a user update. The update is accepted on both A and B (Write = 2).

When the user reads the value, C comes back. It's possible to read a stale value in C. In order to detect the stale value, the user will also read from B (Read = 2).

When the user received updates from B and C, a version number can be used to determine which value is newer(B has a newer version number).

In this scenario, where Write = 2, Read = 2, N = 3, R + W > 3, we are certain that any stale value can be detected.

For R + W = 3, it's possible to have written in A and B, but only read from C. In this case, we can't detect the stale value.

Buttress answered 20/10, 2017 at 15:55 Comment(1)

+1 for pointing out that a version number needs to be attached to the value in order for the latest value to be recognized. – Colt 5/2, 2021 at 3:27

Cassandra uses Leaderless replication. This means there is no single node which is the authority to provide the most recent or correct value. So, we will have to read the value (for a key) using more democratic means i.e. ask multiple nodes and then derive the correct value.

Let's understand it through examples:

Assume for all examples that there are 3 replicas, i.e. N =3. And 3 nodes are A, B, C

R = 1, W = 1, N =3

It basically means we are storing 3 copies of the same data but we have configured that consider read and writes to be successful even if one node responds.

Now, let's take case of updating the value of x to 5 from the current value of 3. During write, assume that write was only successful on node A due to some reason (W value is 1) so it will be considered as successful write.

Now during the read, we can get below values: if node A is reachable; the client reads the value of 5. (i.e. gets correct values) if node A is unreachable/down. The client gets the stale value of 3. So clearly, this configuration (R+W < N) will not provide consistent read.

R = 1, W = 2, N =3

Here, though the write is happening to two nodes but still read will be confirmed only from 1 node. Read can still happen from a node which does't have the latest value. So clearly, this configuration (R+W = N) will not provide consistent read.

R = 2, W = 2, N =3

Best case (read and write from the same set of nodes): write to A, B and Read: A, B => Consistent read i.e. latest value.
Worst case (one node is common): write to A,B and read: B,C => Consistent read since the there is an overlap of node B.

So only R+W > N guarantees the consistent read.

You can explore more options here.

Osteoma answered 15/12, 2019 at 18:3 Comment(0)

Recommended topics

Hot tags