Cassandra eats up all the disk space

S

3

5

I have a single node cassandra cluster, I use the current minute as partition key and insert rows with TTL of 12 hours.

I see a couple of issue I can't explain

The /var/lib/cassandra/data/<key_space>/<table_name> contains multiple files, lots of them are really old (way older then 12 hours, something like 2 days)
When I try to perform a query in cqlsh I get a lot of logs that seem to indicate that my table contain lots of tombstones

log:

WARN  [SharedPool-Worker-2] 2015-01-26 10:51:39,376 SliceQueryFilter.java:236 - Read 0 live and 1571042 tombstoned cells in <table_name>_name (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN  [SharedPool-Worker-2] 2015-01-26 10:51:40,472 SliceQueryFilter.java:236 - Read 0 live and 1557919 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN  [SharedPool-Worker-2] 2015-01-26 10:51:41,630 SliceQueryFilter.java:236 - Read 0 live and 1589764 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN  [SharedPool-Worker-2] 2015-01-26 10:51:42,877 SliceQueryFilter.java:236 - Read 0 live and 1582163 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN  [SharedPool-Worker-2] 2015-01-26 10:51:44,081 SliceQueryFilter.java:236 - Read 0 live and 1550989 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN  [SharedPool-Worker-2] 2015-01-26 10:51:44,869 SliceQueryFilter.java:236 - Read 0 live and 1566246 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN  [SharedPool-Worker-2] 2015-01-26 10:51:45,582 SliceQueryFilter.java:236 - Read 0 live and 1577906 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN  [SharedPool-Worker-2] 2015-01-26 10:51:46,443 SliceQueryFilter.java:236 - Read 0 live and 1571493 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN  [SharedPool-Worker-2] 2015-01-26 10:51:47,701 SliceQueryFilter.java:236 - Read 0 live and 1559448 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN  [SharedPool-Worker-2] 2015-01-26 10:51:49,255 SliceQueryFilter.java:236 - Read 0 live and 1574936 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}

I've tried multiple compaction strategies, multithreaded compaction, I've tried running compaction manually with nodetool, also, I've tried forcing garbage collection with jmx.

One of my guesses is that the compaction doesn't delete tombstones files

Any ideas how to avoid disk space from getting too big, my biggest concern is running out of space, I'd rather store less (by making the ttl smaller but currently that doesn't help)

Speed answered 26/1, 2015 at 8:58 Comment(0)

C

3

I'm assuming you are using the timestamp as a clustering column within each partition when you say you are using the minute as the partition key, along with a TTL of 12 hours when you do the insert. This will build up tombstones in each partition since you are never deleting the entire row (i.e. a whole minute partition).

Suppose your keyspace is called k1 and your table is called t2, then you can run:

nodetool flush k1 t2
nodetool compact k1 t2
sstable2json /var/lib/cassandra/data/k1/t2/k1-t2-jb-<last version>-Data.db

then you'll see all the tombstones like this (marked with a "d" for deleted):

{"key": "00000003","columns": [["4:","54c7b514",1422374164512000,"d"], ["5:","54c7b518",1422374168501000,"d"], ["6:","54c7b51b",1422374171987000,"d"]]}

Now if you go and delete that row (i.e. delete from k1.t2 where key=3;), then do the flush, compact, and sstable2json again, you'll see it change to:

{"key": "00000003","metadata": {"deletionInfo": {"markedForDeleteAt":1422374340312000,"localDeletionTime":1422374340}},"columns": []}

So you see all the tombstones are gone and Cassandra only has to remember that the whole row was deleted at a certain time instead of little bits and pieces of the row being deleted at certain times.

Another way to get rid of the tombstones is to truncate the whole table. When you do that, Cassandra only needs to remember that the whole table was truncated at a certain time, and so no longer needs to keep tombstones prior to that time (since tombstones are used to tell other nodes that certain data was deleted, and if you can say the whole table was emptied at time x, then the details prior to that no longer matter).

So how could you apply this in your situation you ask. Well, you could use the hour and minute as your partition key, and then once an hour run a cron job to delete all the rows from 13 hours ago. Then on the next compaction, all the tombstones for that partition would be removed.

Or keep a separate table for each hour, and then truncate the table from 13 hours ago each hour using a cron job.

Another strategy that is sometimes useful is to "re-use" clustering keys. For example, if you were inserting data once per second, instead of having a high resolution timestamp as a clustering key, you could use the time modulo 60 seconds as the clustering key and keep the more unique timestamp as just a data field. So within each minute partition you would be changing tombstones (or outdated information) from yesterday back into live rows today, and then you wouldn't accumulate tombstones over many days.

So hopefully that gives you some ideas for things to try. Usually when you run into a tombstone problem, it's a sign that you need to re-think your schema a little bit.

Cangue answered 27/1, 2015 at 16:58 Comment(0)

C

4

Tombstones will be preserved for 10 days using the default configuration. The reason for this is to make sure that offline nodes will be able to catch up with deletes when they join the cluster again. You can configure this value by setting the gc_grace_seconds setting.

Caruso answered 26/1, 2015 at 11:35 Comment(4)

Stefan thanks ! But I already tried it, currently gc_grace_seconds is 0. Sorry I forgot to mention that – Speed 26/1, 2015 at 11:43

Are you sure you specified the TTL value correctly (using seconds, not milliseconds)? – Caruso 26/1, 2015 at 12:6

Yes I even see that the rows are deleted, I do see a lot of tombstones, it looks like all the files are tombstones – Speed 26/1, 2015 at 12:20

If you see tombstones after doing a compaction then gc grace is not 0. – Frieze 26/1, 2015 at 17:14

C

3

I'm assuming you are using the timestamp as a clustering column within each partition when you say you are using the minute as the partition key, along with a TTL of 12 hours when you do the insert. This will build up tombstones in each partition since you are never deleting the entire row (i.e. a whole minute partition).

Suppose your keyspace is called k1 and your table is called t2, then you can run:

nodetool flush k1 t2
nodetool compact k1 t2
sstable2json /var/lib/cassandra/data/k1/t2/k1-t2-jb-<last version>-Data.db

then you'll see all the tombstones like this (marked with a "d" for deleted):

{"key": "00000003","columns": [["4:","54c7b514",1422374164512000,"d"], ["5:","54c7b518",1422374168501000,"d"], ["6:","54c7b51b",1422374171987000,"d"]]}

Now if you go and delete that row (i.e. delete from k1.t2 where key=3;), then do the flush, compact, and sstable2json again, you'll see it change to:

{"key": "00000003","metadata": {"deletionInfo": {"markedForDeleteAt":1422374340312000,"localDeletionTime":1422374340}},"columns": []}

So you see all the tombstones are gone and Cassandra only has to remember that the whole row was deleted at a certain time instead of little bits and pieces of the row being deleted at certain times.

Another way to get rid of the tombstones is to truncate the whole table. When you do that, Cassandra only needs to remember that the whole table was truncated at a certain time, and so no longer needs to keep tombstones prior to that time (since tombstones are used to tell other nodes that certain data was deleted, and if you can say the whole table was emptied at time x, then the details prior to that no longer matter).

So how could you apply this in your situation you ask. Well, you could use the hour and minute as your partition key, and then once an hour run a cron job to delete all the rows from 13 hours ago. Then on the next compaction, all the tombstones for that partition would be removed.

Or keep a separate table for each hour, and then truncate the table from 13 hours ago each hour using a cron job.

Another strategy that is sometimes useful is to "re-use" clustering keys. For example, if you were inserting data once per second, instead of having a high resolution timestamp as a clustering key, you could use the time modulo 60 seconds as the clustering key and keep the more unique timestamp as just a data field. So within each minute partition you would be changing tombstones (or outdated information) from yesterday back into live rows today, and then you wouldn't accumulate tombstones over many days.

So hopefully that gives you some ideas for things to try. Usually when you run into a tombstone problem, it's a sign that you need to re-think your schema a little bit.

Cangue answered 27/1, 2015 at 16:58 Comment(0)

B

2

I have a similar issue, only in my case there was just a single table that refused to shrink (old files are not deleted and their storage space keeps growing). I used nodetool compactionstats and saw there are a lot of pending compaction tasks. Another interesting thing was i saw in the nodetool compactionstats always showed compactions of compaction type Compaction for the problematic table, but not of type Tombstone Compaction, as oppose to the tables that behaved good. Could it be the problem?

Behn answered 26/1, 2015 at 12:43 Comment(0)

Recommended topics

Hot tags