Cassandra has a limit of 2 billion cells per partition, but what's a partition?
Asked Answered
C

2

36

In Cassandra Wiki, it is said that there is a limit of 2 billion cells (rows x columns) per partition. But it is unclear to me what is a partition?

Do we have one partition per node per column family, which would mean that the max size of a column family would be 2 billion cells * number of nodes in the cluster.

Or will Cassandra create as much partitions as required to store all the data of a column family?

I am starting a new project so I will use Cassandra 2.0.

Ceciliacecilio answered 11/12, 2013 at 7:8 Comment(0)
L
68

With the advent of CQL3 the terminology has changed slightly from the old thrift terms.

Basically

Create Table foo (a int , b int, c int, d int, PRIMARY KEY ((a,b),c))

Will make a CQL3 table. The information in a and b is used to make the partition key, this describes which node the information will reside on. This is the 'partiton' talked about in the 2 billion cell limit.

Within that partition the information will be organized by c, known as the clustering key. Together a,b and c, define a unique value of d. In this case the number of cells in a partition would be c * d. So in this example for any given pair of a and b there can only be 2 billion combinations of c and d

So as you model your data you want to ensure that the primary key will vary so that your data will be randomly distributed across Cassandra. Then use clustering keys to ensure that your data is available in the way you want it.

Watch this video for more info on Datmodeling in cassandra The Datamodel is Dead, Long live the datamodel

Edit: One more example from the comments

Create Table foo (a int , b int, c int, d int, e int, f int, PRIMARY KEY ((a,b),c,d))

Partitions will be uniquely identified by a combination of a and b.

Within a partition c and d will be used to order cells within the partition so the layout will look a little like:

(a1,b1) --> [c1,d1 : e1], [c1,d1  :f1], [c1,d2 : e2] ....  

So in this example you can have 2 Billion cells with each cell containing:

  • A value of c
  • A value of d
  • A value of either e or f

So the 2 billion limit refers to the sum of unique tuples of (c,d,e) and (c,d,f).

Love answered 11/12, 2013 at 16:51 Comment(9)
But for each pair of (a,b), c will define d so in your equation c* d, d=1, right? So there are 2 billion Cs for each pair of a and b?Kreplach
Exactly @user1944408, I guess I should have made my example a little more complicated.Love
@BenoitThiery 2 billion Cs or 1 billion Cs? As per your answer c * d must be 2Billion. So for 1 billion Cs there will be equivalent of 1 billion Ds. Whether my understanding is correct? Create Table foo (a int , b int, c int, d int, e int, f int, PRIMARY KEY ((a,b),c,d)) can u please explain the same for the above schema so that I can get clear pictureMisdemeanor
@RussS: [c1,d1 : e1], [c1,d1 :f1]. I don't think these can co-exist given for this two the PK combination of a1, b1, c1 and d1 is the same. So it should ideally override each other and not exist as a separate tuple.Thrall
@Thrall Those are two different CQL Columns The equivalent to the row (a,b,c,d,e,f) -> (1,1,1,1,1,1)Love
if we do not have c, d as part of primary key how does this 2B thing changes ?Exonerate
@lahiru Not sure what you are asking, if there are no clustering column then the partition is essentially static in length with the number of data cols.Love
@Love thanks for the response, if I do not add c, d as part of primary key and when we add more values to c, d with same (a, b) still the new columns will be added to same row ? only difference is if we add same (c, d) values with same (a, b) it will overwrite if we had c, d in our primary key and not create a new column? Am I correct here ?Exonerate
If C and D are not part of the primary key you will overwrite the old value. A clustering key is need to make the partition "wide"Love
H
4

From : http://www.datastax.com/documentation/cql/3.0/cql/cql_reference/create_table_r.html


Using a composite partition key¶

A composite partition key is a partition key consisting of multiple columns. You use an extra set of parentheses to enclose columns that make up the composite partition key. The columns within the primary key definition but outside the nested parentheses are clustering columns. These columns form logical sets inside a partition to facilitate retrieval. 


CREATE TABLE Cats (
  block_id uuid,
  breed text,
  color text,
  short_hair boolean,
  PRIMARY KEY ((block_id, breed), color, short_hair)
);

For example, the composite partition key consists of block_id and breed. The clustering columns, color and short_hair, determine the clustering order of the data. Generally, Cassandra will store columns having the same block_id but a different breed on different nodes, and columns having the same block_id and breed on the same node.


Implication

==> Partition is the smallest unit of replication (which on its own makes sh** no sense. :) )

==> Every combination of block_id and breed is a Partition.

==> On any given machine in cluster, either all or none of the rows with same partition-key will exist.

Honduras answered 17/9, 2014 at 1:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.