What are the differences between a node, a cluster and a datacenter in a cassandra nosql database?
Asked Answered
N

4

47

I am trying to duplicate data in a cassandra nosql database for a school project using datastax ops center. From what I have read, there is three keywords: cluster, node, and datacenter, and from what I have understand, the data in a node can be duplicated in another node, that exists in another cluster. And all the nodes that contains the same (duplicated) data compose a datacenter. Is that right?

If it is not, what is the difference?

Nahuatlan answered 28/1, 2015 at 15:45 Comment(0)
B
90

The hierarchy of elements in Cassandra is:

  • Cluster
    • Data center(s)
      • Rack(s)
        • Server(s)
          • Node (more accurately, a vnode)

A Cluster is a collection of Data Centers.

A Data Center is a collection of Racks.

A Rack is a collection of Servers.

A Server contains 256 virtual nodes (or vnodes) by default.

A vnode is the data storage layer within a server.

Note: A server is the Cassandra software. A server is installed on a machine, where a machine is either a physical server, an EC2 instance, or similar.

Now to specifically address your questions.

An individual unit of data is called a partition. And yes, partitions are replicated across multiple nodes. Each copy of the partition is called a replica.

In a multi-data center cluster, the replication is per data center. For example, if you have a data center in San Francisco named dc-sf and another in New York named dc-ny then you can control the number of replicas per data center.

As an example, you could set dc-sf to have 3 replicas and dc-ny to have 2 replicas.

Those numbers are called the replication factor. You would specifically say dc-sf has a replication factor of 3, and dc-ny has a replication factor of 2. In simple terms, dc-sf would have 3 copies of the data spread across three vnodes, while dc-sf would have 2 copies of the data spread across two vnodes.

While each server has 256 vnodes by default, Cassandra is smart enough to pick vnodes that exist on different physical servers.

To summarize:

  • Data is replicated across multiple virtual nodes (each server contains 256 vnodes by default)
  • Each copy of the data is called a replica
  • The unit of data is called a partition
  • Replication is controlled per data center
Brynnbrynna answered 12/2, 2015 at 3:30 Comment(5)
According to link - One (Token) Ring to Rule Them All one cluster has one ring. So complete token ring may actually exist in a cluster instead of a data center.Strobe
@Strobe that link no longer works - do you have a cached copy somewhere or could you summarize what it said? I think it might be relevant to an issue i'm running into right now. Thank youNagari
Assume you have 6 servers w/ 1 node per server in DC1, DC2. The node tokens are 1 (node1), 2 (node2), 3 (node3) in DC1 and 1 (node4), 2 (node5) and 3 (node6) in DC2. A partition's token is created via a hash. The partition token is matched to a node token to find the primary replica. If a partition has a token of 1, then we know its primary replica in DC1 = node 1 and DC2 = node 4. Example from Apigee: community.apigee.com/articles/13096/…Brynnbrynna
For anyone who's stuck on these concepts, use nodetool ring to view the node tokens in your cluster. It'll make the concepts clear for you especially if you're confused by conflicting information on the internet.Brynnbrynna
Can a node or server be a member of multiple datacenters?Vicegerent
C
29

A node is a single machine that runs Cassandra. A collection of nodes holding similar data are grouped in what is known as a "ring" or cluster.

Sometimes if you have a lot of data, or if you are serving data in different geographical areas, it makes sense to group the nodes of your cluster into different data centers. A good use case of this, is for an e-commerce website, which may have many frequent customers on the east coast and the west coast. That way your customers on the east coast connect to your east coast DC (for faster performance), but ultimately have access to the same dataset (both DCs are in the same cluster) as the west coast customers.

More information on this can be found here: About Apache Cassandra- How does Cassandra work?

And all the nodes that contains the same (duplicated) data compose a datacenter. Is that right?

Close, but not necessarily. The level of data duplication you have is determined by your replication factor, which is set on a per-keyspace basis. For instance, let's say that I have 3 nodes in my single DC, all storing 600GB of product data. My products keyspace definition might look like this:

CREATE KEYSPACE products
WITH replication = {'class': 'NetworkTopologyStrategy', 'MyDC': '3'};

This will ensure that my product data is replicated equally to all 3 nodes. The size of my total dataset is 600GB, duplicated on all 3 nodes.

But let's say that we're rolling-out a new, fairly large product line, and I estimate that we're going to have another 300GB of data coming, which may start pushing the max capacity of our hard drives. If we can't afford to upgrade all of our hard drives right now, I can alter the replication factor like this:

CREATE KEYSPACE products
WITH replication = {'class': 'NetworkTopologyStrategy', 'MyDC': '2'};

This will create 2 copies of all of our data, and store it in our current cluster of 3 nodes. The size of our dataset is now 900GB, but since there are only two copies of it (each node is essentially responsible for 2/3 of the data) our size on-disk is still 600GB. The drawback here, is that (assuming I read and write at a consistency level of ONE) I can only afford to suffer a loss of 1 node. Whereas with 3 nodes and a RF of 3 (again reading and writing at consistency ONE), I could lose 2 nodes and still serve requests.

Edit 20181128

When I make a network request am I making that against the server? or the node? Or I make a request against the server does it then route it and read from the node or something else?

So real quick explanation: server == node

As far as making a request against the nodes in your cluster, that behavior is actually dictated from the driver on the application side. In fact, the driver maintains a copy of the current network topology, as it reads the cluster gossip similar to how the nodes do.

On the application side, you can set a load balancing policy. Specifically, the TokenAwareLoadBalancingPolicy class will examine the partition key of each request, figure out which node(s) has the data, and send the request directly there.

For the other load balancing policies, or for queries where a single partition key cannot be determined, the request will be sent to a single node. This node will act as a "coordinator." This chosen node will handle the routing of requests to the nodes responsible for them, as well as the compilation/returning of any result sets.

Cornered answered 28/1, 2015 at 16:42 Comment(4)
I'm an iOS developer, trying to understand these. For me everything just communicates to the server ie it makes a network request and gets a response. When I make a network request am I making that against the server? or the node? Or I make a request against the server does it then route it and read from the node or something else? Is it possible that you add an image?Opacity
@Honey Edit made.Cornered
Do you mean ring as a data structure or what ?Sailmaker
@Sailmaker No. A "ring" is sometimes used to refer to a cluster or group of several nodes (machines).Cornered
J
12

Node:

A machine which stores some portion of your entire database. This may included data replicated from another node as well as it's own data. What data it is responsible for is determined by it's token ranges, and the replication strategy of the keyspace holding the data.

Datacenter:

A logical grouping of Nodes which can be separated from another nodes. A common use case is AWS-EAST vs AWS-WEST. The replication NetworkTopologyStrategy is used to specify how many replicas of the entire keyspace should exist in any given datacenter. This is how Cassandra users achieve cross-dc replication. In addition their are Consistency Level policies that only require acknowledgement only within the Datacenter of the coordinator (LOCAL_*)

Cluster

The sum total of all the machines in your database including all datacenters. There is no cross-cluster replication.

Jollity answered 28/1, 2015 at 16:41 Comment(6)
We answer within 30 seconds of each other, and both use the east/west coast data center example. What are the odds of that? LOL.Cornered
Ha, Cassandra SO feels like a small place some times :)Jollity
If a cluster is the sum total of all the machines does it mean that there is only one cluster then? What do people mean when they refer to multiple clusters?Overnice
Multiple clusters would be a multiple fully independent databases. They would not communicate.Jollity
I'm an iOS developer, trying to understand these. For me everything just communicates to the server ie it makes a network request and gets a response. When I make a network request am I making that against the server? or the node? Or I make a request against the server does it then route it and read from the node or something else? Is it possible that you add an image?Opacity
Nodes are Servers. When you (a driver) makes a request, you are sending that to an arbitrary node/server which will then forward it to the node/server which has the data requested.Jollity
A
0

As per below documents:- https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/architecture/archIntro.html

Node Where you store your data. It is the basic infrastructure component of Cassandra.

Datacenter A collection of related nodes. A datacenter can be a physical datacenter or virtual datacenter. Different workloads should use separate datacenters, either physical or virtual. Replication is set by datacenter. Using separate datacenters prevents Cassandra transactions from being impacted by other workloads and keeps requests close to each other for lower latency. Depending on the replication factor, data can be written to multiple datacenters. datacenters must never span physical locations.

Cluster A cluster contains one or more datacenters. It can span physical locations.

Anatollo answered 5/3, 2020 at 5:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.