Using Cassandra and CQL3, how do you insert an entire wide row in a single request?

P

4

6

I want to insert a single row with 50,000 columns into Cassandra 1.2.8. Before inserting, I have all the data for the entire row ready to go (in memory):

+---------+------+------+------+------+-------+
|         | 0    | 1    | 2    | ...  | 49999 |
| row_id  +------+------+------+------+-------+
|         | text | text | text | ...  | text  |
+---------+------+------+------|------+-------+

The column names are integers, allowing slicing for pagination. The column values are a value at that particular index.

CQL3 table definition:

create table results (
    row_id text,
    index int,
    value text,
    primary key (row_id, index)
) 
with compact storage;

As I already have the row_id and all 50,000 name/value pairs in memory, I just want to insert a single row into Cassandra in a single request/operation so it is as fast as possible.

The only thing I can seem to find is to do execute the following 50,000 times:

INSERT INTO results (row_id, index, value) values (my_row_id, ?, ?);

the first ? is is an index counter (i) and the second ? is the text value to store at location i.

This takes a lot of time. Even when we put the above INSERTs into a batch, it takes a lot of time.

We have all the data we need (the complete row) in its entirety, I would assume it to be very easy to just say "here, Cassandra, store this data as a single row in one request", for example:

//EXAMPLE-BUT-INVALID CQL3 SYNTAX:
insert into results (row_id, (index,value)) values 
    ((0,text0), (1,text1), (2,text2), ..., (N,textN));

This example isn't possible via current CQL3 syntax, but I hope it illustrates the desired effect: everything would be inserted as a single query.

Is it possible to do this in CQL3 and the DataStax Java Driver? If not, I suppose I'll be forced to use Hector or the Astyanax driver and the Thrift batch_insert operation instead?

Phenacaine answered 29/8, 2013 at 22:50 Comment(2)

Have you tried using lists / sets / maps. For this case it should do the trick but, as Alex says, it would make an interesting addition to CQL3. – Weinreb 30/8, 2013 at 8:20

Yes, we have tried, and it was reasonably fast, but it completely breaks the desired data model: you cannot do slice queries on CQL3 collections. – Phenacaine 30/8, 2013 at 17:48

P

3

Edit: only 4 days after I posted this question regarding Cassandra 1.2.9, Cassandra 2.0 final was released. 2.0 supports batch prepared statements, which should be much faster than the non-batched CQL3 that was required to be used for C* < 2.0. We have not yet tested this to be sure.

When this question was posted 4 days ago on 30 August 2013, it was not possible in CQL3 for C* versions less than 2.0. It was only possible via a Thrift client, e.g. Astyanax's MutationBatch.

Per Alex's suggestion, I created CASSANDRA-5959 as a feature request, but it was marked as a duplicate to CASSANDRA-4693, which supposedly solved the issue for C* 2.0.

Phenacaine answered 30/8, 2013 at 18:31 Comment(2)

Thanks Les. While I do agree that this could be considered at this time a limitation of the java driver, I actually think it's more of a CQL limitation. Hopefully Cassandra guys will agree and add it. – Kaffraria 30/8, 2013 at 19:45

As an addition to this - I had a conversation with thobbs on the #cassandra channel. He said that unlogged batches sent to one partition are performed as a single operation so fairly efficient queries can be obtained that way. The suggestion was to batch in sizes of ~1k to avoid putting too much heap pressure on a node. – Intenerate 5/3, 2015 at 19:12

H

3

Multiple INSERTs / UPDATEs can be done using batch_mutate method in Thrift APIs, by making use of mutation multi-maps.

Map<byte[], Map<String, List<Mutation>>> mutationMap = new HashMap<byte[], Map<String, List<Mutation>>>();

List<Mutation> mutationList = new ArrayList<Mutation>();

mutationList.add(mutation);
Map<String, List<Mutation>> m = new HashMap<String, List<Mutation>>();

m.put(columnFamily, mutationList);

mutationMap.put(key, m);
client.batch_mutate(mutationMap, ConsistencyLevel.ALL);

Hoeg answered 30/8, 2013 at 12:51 Comment(1)

the question is how to do it with CQL 3 not thrift – Incidentally 17/1, 2014 at 16:37

P

3

Edit: only 4 days after I posted this question regarding Cassandra 1.2.9, Cassandra 2.0 final was released. 2.0 supports batch prepared statements, which should be much faster than the non-batched CQL3 that was required to be used for C* < 2.0. We have not yet tested this to be sure.

When this question was posted 4 days ago on 30 August 2013, it was not possible in CQL3 for C* versions less than 2.0. It was only possible via a Thrift client, e.g. Astyanax's MutationBatch.

Per Alex's suggestion, I created CASSANDRA-5959 as a feature request, but it was marked as a duplicate to CASSANDRA-4693, which supposedly solved the issue for C* 2.0.

Phenacaine answered 30/8, 2013 at 18:31 Comment(2)

Thanks Les. While I do agree that this could be considered at this time a limitation of the java driver, I actually think it's more of a CQL limitation. Hopefully Cassandra guys will agree and add it. – Kaffraria 30/8, 2013 at 19:45

As an addition to this - I had a conversation with thobbs on the #cassandra channel. He said that unlogged batches sent to one partition are performed as a single operation so fairly efficient queries can be obtained that way. The suggestion was to batch in sizes of ~1k to avoid putting too much heap pressure on a node. – Intenerate 5/3, 2015 at 19:12

K

2

CQL3 INSERT statement doesn't support multiple value tuples. But I think this could make an interesting addition to CQL so please submit a feature request.
The DataStax Java driver is based on CQL so there's anything it can do if the statement is not supported.
For the time being if you need this your best option would be to use a Thrift-based library (nb: I'm not very familiar with Thrift-based API to confirm this insert would be possible, but I think it should)

Kaffraria answered 30/8, 2013 at 6:45 Comment(3)

Just an update - this is definitely possible with Thrift. Our test w/ the Datastax Java Driver and a CQL3 batch (using the actual Batch API) on a local dev machine took 1.5 minutes. The same operation with Astyanax (via a MutationBatch aka batch_mutate) took 235 milliseconds. This does not bode well for the Datastax Java Driver in our project. That being said, I'm an appreciative open-source citizen, so I'll open a feature request. – Phenacaine 30/8, 2013 at 17:50

@Les Hazlewood the perf is so bad with actual Bath API probably because it is plain text query (so parsing text is pricey). If you have the opportunity to test batch API using prepared statement I'll be interested having the results. There was a big debate about CQL3 perf vs Thrift – Kith 3/9, 2013 at 20:39

If I remember to post the results once we've tested, I certainly will! – Phenacaine 4/9, 2013 at 0:35

K

0

Use Batch statement in CQL3 if you want to do multiple insert.

With C* 2.0, it'll be even easier and faster since they'll enable prepared statement in batch

Kith answered 31/8, 2013 at 19:26 Comment(3)

Per my original post, Batch statements in CQL3 for wide rows < C* 2.0 are extremely slow. – Phenacaine 3/9, 2013 at 18:7

Completely agree with you Les Hazlewood. Fortunately C* 2.0 has just been released so you can go with it :) – Kith 3/9, 2013 at 20:37

I tried prepared batch statements with Cassandra 2.0, and it is still painfully slow. #21779171 – Trevethick 14/2, 2014 at 20:25

Recommended topics

Hot tags