how to archive and purge Cassandra data

Asked 7/9, 2015 at 10:13 Answered 17/2, 2021 at 18:55

I have a cassandra cluster with multiple data centres. I want to archive data monthly and purge that data. There are numerous articles of backing up and restoring but not where its mentioned to archive data in cassandra cluster.

Can someone please let me know how can I archive my data in cassandra cluster monthly and purge the data.

Hoyt answered 7/9, 2015 at 10:13 Comment(3)

can u use Java or C# ? You can create one console that will extract the data from cassandra and archive it – Blackfoot 15/9, 2015 at 9:5

Is there any other method other than using spark jobs which is inbuilt in cassandra? – Hoyt 15/9, 2015 at 9:7

lets talk about the archival, its the thing we need to do periodically to save the disk space so its not a realtime job, so let a batch process do it, second thing is we need to do it for releasing space so the CQL run faster, its nothing but get the data -> compress it ->put it in another location, so i would suggest have a batch job which take data out of cassandra and compress it out of cluster... or withing cluster with snappy – Caducity 22/9, 2015 at 4:53

I think there is no such tool that can be used for archive cassandra.You have to write either Spark Jobs or map reduce job that use CqlInputFormat to archive the data.You can follow below links that help you to understand how people are archiving data in cassandra:

[1] - [http://docs.wso2.org/display/BAM240/Archive+Cassandra+Data]

[2] - http://docs.wso2.org/pages/viewpage.action?pageId=32345660

[3] - http://accelconf.web.cern.ch/AccelConf/ICALEPCS2013/papers/tuppc004.pdf

There is also a way using which you can turn on incremental backup in cassandra which can be used like CDC.

Wame answered 22/9, 2015 at 9:20 Comment(1)

I am not able to access any of the links. – Hoyt 1/10, 2015 at 8:45

It is the best practice to use timewindow compaction strategy and set the window of monthly on your tables along with TTL(month), so that data older than a month can be purged.

If you write a purge job that does this work of deletion (on tables which do not have correct compaction strategy applied) then this can impact the cluster performance because searching the data on date/month basic will overwhelm the cluster.

I have experienced this, where we ultimately have to go back changing the structure of tables and altered the compaction strategy. That is why having the table design right at the first place is very important. We need to think about (in the beginning itself) not only how the data will be inserted and read in tables but also how it will be deleted and then frame the keys, compaction, ttl, etc.

For archiving just write a few lines of code to read data from Cassandra and put it to you archival location.

Let me know if this help in getting the end result you want or if you have further question that I can help with.

Cavy answered 17/2, 2021 at 18:55 Comment(6)

it won't work if you do the data changes... TWCS has quite a limited use – Pyknic 18/2, 2021 at 12:20

Cassandra by design is immutable, so you just can not change anything atall. Even if you require to change/update any record, it will be an INSERT and the older record will be marked as tombstone. TWCS is useful to support efficient deletion. – Cavy 19/2, 2021 at 15:1

I know that data on the disk in Cassandra is immutable... My point was that TWCS works good only if you don't make many changes, and there is a limited number of the use cases where it's possible to achieve... – Pyknic 19/2, 2021 at 16:24

The question above needs to know how to purge data on monthly basis, so for this case specifically the TWCS will work fine. For your concern - if you are saying that you want to go back in time and change the history or do any kind of design changes, there are less chances. Also in that case I would suggest to re-thing on the choice of the tools itself. Cassandra might not be the right choice for every use case. I hope this make sense. – Cavy 23/2, 2021 at 15:58

I've seen many customers with data deletion requirements, but the data is changing along the way - think about transaction processing. You ingest initial transaction with status "Started", then it's going to "Checking", then to "Paid", and the to "Done" - everything maybe during 1 second. But these are changes, and data immutable after that, and need to be purged... – Pyknic 23/2, 2021 at 16:2

True mate !! but the transaction that are getting updated in few seconds are not going to be deleted in that second ,not even on that day. Now if you want to delete data as soon as it become 1 month old, you may set the TWCS window of 1 day. And thus you have 30 buckets. There are compaction cycles which will keep data of 1 day in bucket and depending upon the arrival of data it will get purged. Thank you for patience. – Cavy 26/2, 2021 at 19:31

Recommended topics

Hot tags