Export large amount of data from Cassandra to CSV

Asked 22/7, 2014 at 19:38 Answered 11/6, 2020 at 7:54

I'm using Cassandra 2.0.9 for store quite big amounts of data, let's say 100Gb, in one column family. I would like to export this data to CSV in fast way. I tried:

sstable2json - it produces quite big json files which are hard to parse - because tool puts data in one row and uses complicated schema (ex. 300Mb Data file = ~2Gb json), it takes a lot of time to dump and Cassandra likes to change source file names according its internal mechanism
COPY - causes timeouts on quite fast EC2 instances for big number of records
CAPTURE - like above, causes timeouts
reads with pagination - I used timeuuid for it, but it returns about 1,5k records per second

I use Amazon Ec2 instance with fast storage, 15 Gb of RAM and 4 cores

Is there any better option for export gigabytes of data from Cassandra to CSV?

Corby answered 22/7, 2014 at 19:38 Comment(3)

Have you considered making your own little contraption for this ? Using datastax driver you could easily make requests that extract you data then serialize them in csv file(s) with little to no java code ? This would ensure you to get the exact result you want (for a little effort though). – Klemperer 30/10, 2014 at 11:2

Moreover, I don't get neither the method nor the problem on the "reads with pagination". – Klemperer 30/10, 2014 at 11:3

reads with pagination - using python driver I tried to read content using limit (tested values 100 - 10000, based on TimeUuid) and offset, it was really slow, Cassandra was able to read about 1,5k of records per second on 3 instances and replication factor 2 I cannot imagine, that simply using driver will makes possible to build fast read, because for each row Cassandra has to check on which node data are stored. – Corby 30/10, 2014 at 20:36

Update for 2020th: DataStax provides a special tool called DSBulk for loading and unloading of data from Cassandra (starting with Cassandra 2.1), and DSE (starting with DSE 4.7/4.8). In simplest case, the command line looks as following:

dsbulk unload -k keyspace -t table -url path_to_unload

DSBulk is heavily optimized for loading/unloading operations, and has a lot of options, including import/export from/to compressed files, providing the custom queries, etc.

There is a series of blog posts about DSBulk, that could provide more information & examples: 1, 2, 3, 4, 5, 6

Few answered 11/6, 2020 at 7:54 Comment(0)

Because using COPY will be quite challenging when you are trying to export a table with millions of rows from Cassandra, So what I have done is to create simple tool to get the data chunk by chunk (paginated) from cassandra table and export it to CSV.

Look at my example solution using java library from datastax.

Sheepherder answered 19/1, 2015 at 17:10 Comment(2)

As I wrote before: I tried this solution, but It was very slow - it may delivered 1,5k records per second. Did You any benchmark for replica set and 100Gb of data? – Corby 21/1, 2015 at 19:3

The biggest table in our datasets was: 22gb of data ~ 122mil rows ~ extracted in 444m38.061s. (using 1 host cassandra server 4core 8gb ram sata hdd) (which is around 4.5k of rows / second, this is good enough for us because we only do this for migrating the data to somewhere else) – Sheepherder 23/1, 2015 at 18:29

Inspired by @user1859675 's answer, Here is how we can export data from Cassandra using Spark

val cassandraHostNode = "10.xxx.xxx.x5,10.xxx.xxx.x6,10.xxx.xxx.x7";
val spark = org.apache.spark.sql.SparkSession
                                    .builder
                                    .config("spark.cassandra.connection.host",  cassandraHostNode)
                                    .appName("Awesome Spark App")
                                    .master("local[*]")
                                    .getOrCreate()

val dataSet = spark.read.format("org.apache.spark.sql.cassandra")
                        .options(Map("table" -> "xxxxxxx", "keyspace" -> "xxxxxxx"))
                        .load()

val targetfilepath = "/opt/report_values/"
dataSet.write.format("csv").save(targetfilepath)  // Spark 2.x

You will need "spark-cassandra-connector" in your classpath for this to work.
The version I am using is below

    <groupId>com.datastax.spark</groupId>
    <artifactId>spark-cassandra-connector_2.11</artifactId>
    <version>2.3.2</version>

Evidently answered 21/1, 2020 at 11:4 Comment(0)

Recommended topics

Hot tags