Understanding Cassandra's storage overhead
Asked Answered
P

1

8

I have been reading this section of the Cassandra docs and found the following a little puzzling:

Determine column overhead:

regular_total_column_size = column_name_size + column_value_size + 15

counter - expiring_total_column_size = column_name_size + column_value_size + 23

Every column in Cassandra incurs 15 bytes of overhead. Since each row in a table can have different column names as well as differing numbers of columns, metadata is stored for each column. For counter columns and expiring columns, you should add an additional 8 bytes (23 bytes total).

The way I interpret the above for a CQL3 defined schema such as:

CREATE TABLE mykeyspace.mytable(
  id text,
  report_id text,
  subset_id text,
  report_date timestamp,
  start_date timestamp,
  end_date timestamp,
  subset_descr text,
  x int,
  y double,
  z int,
  PRIMARY KEY (id, report_id, subset_id)
);

is that each row will contain the metadata for the column names, e.g., the strings report_date, start_date, end_date, etc. and their type along with the data. However, it's not clear to me what it means that each row in a table can have different column names. This sounds wrong to me given the schema above is totally static, i.e., Cassandra 2.0 will most certainly complain if I try to write:

INSERT INTO mykeyspace.mytable (id, report_id , subset_id, x, y, z, w) 
VALUES ( 'asd','qwe','rty',100,1.234,12, 123.123);

Bad Request: Unknown identifier w

Now it looks to me like column names are fixed given this table schema and thus the metadata should not need to be stored per each row. I am guessing either the phrasing in the documentation is outdated (it's the same as Cassandra 1.2) or I'm misunderstanding some core concept at work here.

Can anybody clarify? Bottom line: do I have to worry about the length of the names of my columns or not?

We have been playing it safe and used single character names where possible (so the above columns would actually be i, r, s, dr, ds, de, sd, ...), but it's so non human unreadable and can be confusing to work with.

Pavis answered 30/5, 2014 at 15:44 Comment(3)
"Every column in Cassandra incurs 15 bytes of overhead." - this is meant by row, am I right? So in every row, there is 15 bytes overhead per column, right?Sociology
@Sociology yes, that part is clear.Pavis
I would just like to add, that now, since version 3.x of Cassandra, we no longer get penalized for using long column names! However it might still be a problem for Scylla or others who are still on older versions of the SSTable file format.Posner
C
11

The easiest way to figure out what is going on in situations like this is to check the sstable2json (cassandra/bin) representation of your data. This will show you what ends up actually be saved on disk.

Here is the example for your situation

 [
 {"key": "4b6579","columns": [
       ["rid1:ssid1:","",1401469033325000],
       ["rid1:ssid1:end_date","2004-10-03 00:00:00-0700",1401469033325000],
       ["rid1:ssid1:report_date","2004-10-03 00:00:00-0700",1401469033325000],
       ["rid1:ssid1:start_date","2004-10-03 00:00:00-0700",1401469033325000], 
       ["rid1:ssid1:subset_descr","descr",1401469033325000],
       ["rid1:ssid1:x","1",1401469033325000], 
       ["rid1:ssid1:y","5.5",1401469033325000],
       ["rid1:ssid1:z","1",1401469033325000],
       ["rid2:ssid2:","",1401469938599000],
       ["rid2:ssid2:end_date", "2004-10-03 00:00:00-0700",1401469938599000],
       ["rid2:ssid2:report_date","2004-10-03 00:00:00-0700",1401469938599000],
       ["rid2:ssid2:start_date","2004-10-03 00:00:00-0700",1401469938599000], 
       ["rid2:ssid2:subset_descr","descr",1401469938599000],
       ["rid2:ssid2:x","1",1401469938599000],
       ["rid2:ssid2:y","5.5",1401469938599000],
       ["rid2:ssid2:z","1",1401469938599000]
 }
 ]

The value of the partition key is saved once per partition (per sstable) as you can see above, the column name in this case doesn't matter at all since it is implicit given the table. The column names for the clustering columns are also not present because with C* you aren't allowed to insert without specifying all portions of the key.

Whats left though does have the column name, this is needed incase a partial update to a row is made so it can be saved without the rest of the row information. You could imagine an update to a single column field in a row, to indicate which field this is C* currently uses the column name but there are tickets to change this to a smaller representation. https://issues.apache.org/jira/browse/CASSANDRA-4175

To generate this

cqlsh
CREATE TABLE mykeyspace.mytable(   id text,   report_id text,   subset_id text,   report_date timestamp,   start_date timestamp,   end_date timestamp,   subset_descr text,   x int,   y double,   z int,   PRIMARY KEY (id, report_id, subset_id) );
INSERT INTO mykeyspace.mytable (id, report_id , subset_id , report_date , start_date , end_date , subset_descr ,x, y, z) VALUES ( 'Key', 'rid1','ssid1', '2004-10-03','2004-10-03','2004-10-03','descr',1,5.5,1);
INSERT INTO mykeyspace.mytable (id, report_id , subset_id , report_date , start_date , end_date , subset_descr ,x, y, z) VALUES ( 'Key', 'rid2','ssid2', '2004-10-03','2004-10-03','2004-10-03','descr',1,5.5,1);
exit;
nodetool flush
bin/sstable2json $DATA_DIR/mytable/mykeyspace-mytable-jb-1-Data.db 
Cinda answered 30/5, 2014 at 17:29 Comment(3)
On disk compression obviously will erase most of the loss due to duplicated column names so it's not a big deal for disk space, but for reading it's a bit more of an issue.Cinda
what do you mean for reading? How can I find out the actual size on disk of the compressed data?Pavis
When the data is actually read from the disk it will be uncompressed so your in memory on the server read will have the full column name. Wasted memory. To find out the size on disk you can check nodetool status, or just count up the size of all the files in the data dir - snapshotsCinda

© 2022 - 2024 — McMap. All rights reserved.