Cassandra num_tokens - is this really num_token_partitions?

Asked 15/11, 2013 at 7:3 Answered 27/12, 2018 at 6:41

I am new to Cassandra. I am reading about the num_tokens parameter for virtual nodes in the cassandra.yml file. I don't think I quite understand what this is doing or how tokens/partitions are assigned. What is really going on here?

The default value of 256 does not make any sense if we are really talking about number of tokens/node. Is num_tokens really num_token_partitions/node?

Let us pick 2 nodes A and B to begin with, add a 3rd node C and then try explaining how things work. To begin, each node is configured with num_tokens of 256. Now, when A and B come up

How many tokens do A and B get when they join the cluster? What partition ranges do A and B get and how is that decided?
What kind of meta data is stored in Cassandra to know which partition ranges A and B carry.
What happens when C joins now? How does Cassandra decide what partition ranges C gets? How many partitions should be put on C?
How is the partition range for A and B decided when C joins?

Anybody kind enough to clarify in detail for the benefit of everyone?

Burglarious answered 15/11, 2013 at 7:3 Comment(0)

4) Partition ranges are determined by granting each node the range from their available tokens up until the next specified token.

2)Data is exchanged through gossip detailing which nodes have which tokens. This meta-data allows every node to know which nodes are responsible for which ranges. Keyspace/Replication settings also change where data is actually saved.

EXAMPLE: 1)A gets 256 ranges B gets 256 Ranges. But to make this simple lets give them each 2 tokens and pretend the token range is 0 to 30

Given tokens: A 10,15 and B 3,11 Nodes are responsible for the following ranges

(3-9:B)(10:A)(11-14:B)(15-30,0-2:A)

3)If C Joins also with 2 tokens 20,5 Nodes will now be responsible for the following ranges

(3-4:B)(5-9:C)(10:A)(11-14:B)(15-19:A)(20-30,0-2:C)

Vnodes are powerful because now when C joins the cluster it gets its data from multiple nodes (5-9 from B and 20-30,0-2 from A) sharing the load between those machines. In this toy example you can see that having only 2 tokens allows for some nodes to host the majority of the data while others get almost none. As the number of Vnodes increases the balance between the nodes increases as the ranges become randomly subdivided more and more. At 256 nodes you are extremely likely to have distributed an even amount of data to each node in the cluster.

For more information VNodes: http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2

Cholecyst answered 15/11, 2013 at 15:31 Comment(2)

Why are the ranges not uniformly distributed during the startup? For example, something like that: (0-7:A)(8-15:B)(16-23:A)(24-30:B). Thank you! – Constriction 12/11, 2014 at 11:45

May be best to give a larger answer in a new question. But the basic reason why not is because a uniform distribution of vnodes requires a knowledge of how many nodes are going to be in the cluster. Additionally you then bring back a great deal of the problems with having a single token, ie it is difficult to increase capacity without doubling nodes as single node additions will cause a major rebalancing. – Cholecyst 12/11, 2014 at 16:49

Also RussS answer is correct, I think it's difficult to follow.

The idea is not so much the token allocation, because that's the technical mean used by Cassandra for the concept of distributing access to the data.

What's important are the replication factor and the ring to understand how this is meaningful.

The way the replication works is by copying the data of a node on the next two. So if you're on node A, the data assigned to A is replicated on B and C. The data assigned to B, is replicate on C and D, and so on.

If you have just 3 nodes and a replication of 3, it does not make any difference.

If you have 100 nodes, a replication of 3 and num_tokens: 1, then exactly 3 nodes replicate the data they are assigned and that's always the entire set of data of a node. In our example above, that means all the data A is assigned can be read from A, B, or C and only those three nodes. So if you are trying to load that specific data often and the rest not so often, your cluster is going to be rather unbalanced.

With v-nodes, the data is broken up in sub-partitions. One computer represents many virtual nodes. So old computer A may now represent A, D, G, J, M assuming a num_tokens: 5.

Next we have the ring. When building the ring, the computers will connect between each others in such a way that the same computer doesn't connect to itself (A won't talk to D directly and vice versa.)

Now, it means that one physical computer is going to be connected to num_tokens × replication_factor - 1 other computers. So with num_tokens set to 5 and a replication of 3, you are going to be connected to 10 other computers. This means the load is going to be shared between 10 computers instead of 3 (as the replication factor would otherwise imply.)

So with 16 nodes, a num_tokens: 256 and replication: 3, it would be a strange setup since it would imply that all the nodes are connected 512 times between each others. That being said, having to change the num_tokens later can take a little time for the cluster to adjust to the new value. Especially if you have a large installation. So if you foresee having a large number of nodes, a rather large num_tokens is a good idea from the start.

As a side effect, it will also distribute the data between various tables (files) on each node. That can also help finding data faster. It is actually suggested that you use a larger number of instances (16 to 64) whenever you create an Elassandra cluster to ease the search.

Patroclus answered 27/12, 2018 at 6:41 Comment(0)

At 256 nodes you are extremely likely to have distributed an even amount of data to each node in the cluster.

Unless of course it's not. Random Vnode token range allocation has nothing to do with balanced load. Balanced load is token range ENGINEERED to be balanced, not guessed.

Then there are the bugs in token range allocation CASSANDRA-6388 and CASSANDRA-7032 neither one fixed in any cluster running in production today. Then there are the major problems with 256 VNODE clusters and trying to rebuild them or back them up which is impossible, literally.

Rebuilds and recoveries take WEEKS. And just try running hadoop against vnodes in production. Give up an engineered token range cluster for VNODE hail mary's at your peril.

Rugged answered 22/10, 2015 at 22:35 Comment(0)

Recommended topics

Hot tags