How to copy data from a Cassandra table to another structure for better performance
Asked Answered
S

3

22

In several places it's advised to design our Cassandra tables according to the queries we are going to perform on them. In this article by DataScale they state this:

The truth is that having many similar tables with similar data is a good thing in Cassandra. Limit the primary key to exactly what you’ll be searching with. If you plan on searching the data with a similar, but different criteria, then make it a separate table. There is no drawback for having the same data stored differently. Duplication of data is your friend in Cassandra.

[...]

If you need to store the same piece of data in 14 different tables, then write it out 14 times. There isn’t a handicap against multiple writes.

I have understood this, and now my question is: provided that I have an existing table, say

CREATE TABLE invoices (
    id_invoice int PRIMARY KEY,
    year int,
    id_client int,
    type_invoice text
)

But I want to query by year and type instead, so I'd like to have something like

CREATE TABLE invoices_yr (
    id_invoice int,
    year int,
    id_client int,
    type_invoice text,
    PRIMARY KEY (type_invoice, year)
)

With id_invoice as the partition key and year as the clustering key, what's the preferred way to copy the data from one table to another to perform optimized queries later on?

My Cassandra version:

user@cqlsh> show version;
[cqlsh 5.0.1 | Cassandra 3.5.0 | CQL spec 3.4.0 | Native protocol v4]
Sydel answered 3/1, 2017 at 16:48 Comment(3)
Do you have just vanilla C* (spark?)? What version (if materialized views are available)?Brutus
Regarding the statement "Duplication of data is your friend in Cassandra", I would strongly recommend against this. Duplication of data is not your friend. It can lead to synchronization issues and data corruption. At best it is an acquaintance that should not be fully trusted or relied upon.Egidio
I don't think "Duplication of data is not your friend" as a general statement can be said lightly in the context of Big Data databases either.Sydel
A
12

To echo what was said about the COPY command, it is a great solution for something like this.

However, I will disagree with what was said about the Bulk Loader, as it is infinitely harder to use. Specifically, because you need to run it on every node (whereas COPY needs to only be run on a single node).

To help COPY scale for larger data sets, you can use the PAGETIMEOUT and PAGESIZE parameters.

COPY invoices(id_invoice, year, id_client, type_invoice) 
  TO 'invoices.csv' WITH PAGETIMEOUT=40 AND PAGESIZE=20;

Using these parameters appropriately, I have used COPY to successfully export/import 370 million rows before.

For more info, check out this article titled: New options and better performance in cqlsh copy.

Algar answered 4/1, 2017 at 13:57 Comment(0)
A
20

You can use cqlsh COPY command :
To copy your invoices data into csv file use :

COPY invoices(id_invoice, year, id_client, type_invoice) TO 'invoices.csv';

And Copy back from csv file to table in your case invoices_yr use :

COPY invoices_yr(id_invoice, year, id_client, type_invoice) FROM 'invoices.csv';

If you have huge data you can use sstable writer to write and sstableloader to load data faster. http://www.datastax.com/dev/blog/using-the-cassandra-bulk-loader-updated

Ambient answered 4/1, 2017 at 4:15 Comment(0)
A
12

To echo what was said about the COPY command, it is a great solution for something like this.

However, I will disagree with what was said about the Bulk Loader, as it is infinitely harder to use. Specifically, because you need to run it on every node (whereas COPY needs to only be run on a single node).

To help COPY scale for larger data sets, you can use the PAGETIMEOUT and PAGESIZE parameters.

COPY invoices(id_invoice, year, id_client, type_invoice) 
  TO 'invoices.csv' WITH PAGETIMEOUT=40 AND PAGESIZE=20;

Using these parameters appropriately, I have used COPY to successfully export/import 370 million rows before.

For more info, check out this article titled: New options and better performance in cqlsh copy.

Algar answered 4/1, 2017 at 13:57 Comment(0)
B
1

An alternative to using COPY command (see other answers for examples) or Spark to migrate data is to create a materialized view to do the denormalization for you.

CREATE MATERIALIZED VIEW invoices_yr AS
       SELECT * FROM invoices
       WHERE id_client IS NOT NULL AND type_invoice IS NOT NULL AND id_client IS NOT NULL
       PRIMARY KEY ((type_invoice), year, id_client)
       WITH CLUSTERING ORDER BY (year DESC)

Cassandra will fill the table for you then so you wont have to migrate yourself. With 3.5 be aware that repairs don't work well (see CASSANDRA-12888).

Note: that Materialized Views are probably not best idea to use and has been changed to "experimental" status

Brutus answered 4/1, 2017 at 14:45 Comment(5)
Hey Chris! Good tip about the repairs in 3.5.Algar
Chris, your example raises the following error: InvalidRequest: code=2200 [Invalid query] message="Cannot include more than one non-primary key column 'year' in materialized view partition key"Sydel
For reference: docs.datastax.com/en/cql/3.3/cql/cql_using/useCreateMV.htmlSydel
"Requirements for a materialized view: The columns of the source table's primary key must be part of the materialized view's primary key. Only one new column can be added to the materialized view's primary key. Static columns are not allowed. "Sydel
Materialized views are promising, but as it stands the solution involves modifying the original table and adding clustering keys so they can be included in the materialized view, and that can prevent UPDATEs for instance.Sydel

© 2022 - 2024 — McMap. All rights reserved.