Also RussS answer is correct, I think it's difficult to follow.
The idea is not so much the token allocation, because that's the technical mean used by Cassandra for the concept of distributing access to the data.
What's important are the replication factor and the ring to understand how this is meaningful.
The way the replication works is by copying the data of a node on the next two. So if you're on node A, the data assigned to A is replicated on B and C. The data assigned to B, is replicate on C and D, and so on.
If you have just 3 nodes and a replication of 3, it does not make any difference.
If you have 100 nodes, a replication of 3 and num_tokens: 1
, then exactly 3 nodes replicate the data they are assigned and that's always the entire set of data of a node. In our example above, that means all the data A is assigned can be read from A, B, or C and only those three nodes. So if you are trying to load that specific data often and the rest not so often, your cluster is going to be rather unbalanced.
With v-nodes, the data is broken up in sub-partitions. One computer represents many virtual nodes. So old computer A may now represent A, D, G, J, M assuming a num_tokens: 5
.
Next we have the ring. When building the ring, the computers will connect between each others in such a way that the same computer doesn't connect to itself (A won't talk to D directly and vice versa.)
Now, it means that one physical computer is going to be connected to num_tokens
× replication_factor - 1
other computers. So with num_tokens
set to 5 and a replication of 3, you are going to be connected to 10 other computers. This means the load is going to be shared between 10 computers instead of 3 (as the replication factor would otherwise imply.)
So with 16 nodes, a num_tokens: 256
and replication: 3
, it would be a strange setup since it would imply that all the nodes are connected 512 times between each others. That being said, having to change the num_tokens
later can take a little time for the cluster to adjust to the new value. Especially if you have a large installation. So if you foresee having a large number of nodes, a rather large num_tokens
is a good idea from the start.
As a side effect, it will also distribute the data between various tables (files) on each node. That can also help finding data faster. It is actually suggested that you use a larger number of instances (16 to 64) whenever you create an Elassandra cluster to ease the search.
(0-7:A)(8-15:B)(16-23:A)(24-30:B)
. Thank you! – Constriction