Understanding Cassandra's storage overhead

I have been reading this section of the Cassandra docs and found the following a little puzzling:

Determine column overhead:

regular_total_column_size = column_name_size + column_value_size + 15

counter - expiring_total_column_size = column_name_size + column_value_size + 23

Every column in Cassandra incurs 15 bytes of overhead. Since each row in a table can have different column names as well as differing numbers of columns, metadata is stored for each column. For counter columns and expiring columns, you should add an additional 8 bytes (23 bytes total).

The way I interpret the above for a CQL3 defined schema such as:

CREATE TABLE mykeyspace.mytable(
  id text,
  report_id text,
  subset_id text,
  report_date timestamp,
  start_date timestamp,
  end_date timestamp,
  subset_descr text,
  x int,
  y double,
  z int,
  PRIMARY KEY (id, report_id, subset_id)
);

is that each row will contain the metadata for the column names, e.g., the strings report_date, start_date, end_date, etc. and their type along with the data. However, it's not clear to me what it means that each row in a table can have different column names. This sounds wrong to me given the schema above is totally static, i.e., Cassandra 2.0 will most certainly complain if I try to write:

INSERT INTO mykeyspace.mytable (id, report_id , subset_id, x, y, z, w) 
VALUES ( 'asd','qwe','rty',100,1.234,12, 123.123);

Bad Request: Unknown identifier w

Now it looks to me like column names are fixed given this table schema and thus the metadata should not need to be stored per each row. I am guessing either the phrasing in the documentation is outdated (it's the same as Cassandra 1.2) or I'm misunderstanding some core concept at work here.

Can anybody clarify? Bottom line: do I have to worry about the length of the names of my columns or not?

We have been playing it safe and used single character names where possible (so the above columns would actually be i, r, s, dr, ds, de, sd, ...), but it's so non human unreadable and can be confusing to work with.

[ {"key": "4b6579","columns": [ ["rid1:ssid1:","",1401469033325000], ["rid1:ssid1:end_date","2004-10-03 00:00:00-0700",1401469033325000], ["rid1:ssid1:report_date","2004-10-03 00:00:00-0700",1401469033325000], ["rid1:ssid1:start_date","2004-10-03 00:00:00-0700",1401469033325000], ["rid1:ssid1:subset_descr","descr",1401469033325000], ["rid1:ssid1:x","1",1401469033325000], ["rid1:ssid1:y","5.5",1401469033325000], ["rid1:ssid1:z","1",1401469033325000], ["rid2:ssid2:","",1401469938599000], ["rid2:ssid2:end_date", "2004-10-03 00:00:00-0700",1401469938599000], ["rid2:ssid2:report_date","2004-10-03 00:00:00-0700",1401469938599000], ["rid2:ssid2:start_date","2004-10-03 00:00:00-0700",1401469938599000], ["rid2:ssid2:subset_descr","descr",1401469938599000], ["rid2:ssid2:x","1",1401469938599000], ["rid2:ssid2:y","5.5",1401469938599000], ["rid2:ssid2:z","1",1401469938599000] } ]

cqlsh CREATE TABLE mykeyspace.mytable( id text, report_id text, subset_id text, report_date timestamp, start_date timestamp, end_date timestamp, subset_descr text, x int, y double, z int, PRIMARY KEY (id, report_id, subset_id) ); INSERT INTO mykeyspace.mytable (id, report_id , subset_id , report_date , start_date , end_date , subset_descr ,x, y, z) VALUES ( 'Key', 'rid1','ssid1', '2004-10-03','2004-10-03','2004-10-03','descr',1,5.5,1); INSERT INTO mykeyspace.mytable (id, report_id , subset_id , report_date , start_date , end_date , subset_descr ,x, y, z) VALUES ( 'Key', 'rid2','ssid2', '2004-10-03','2004-10-03','2004-10-03','descr',1,5.5,1); exit; nodetool flush bin/sstable2json $DATA_DIR/mytable/mykeyspace-mytable-jb-1-Data.db

Recommended topics

Hot tags