Using Cassandra and CQL3, how do you insert an entire wide row in a single request?
Asked Answered
P

4

6

I want to insert a single row with 50,000 columns into Cassandra 1.2.8. Before inserting, I have all the data for the entire row ready to go (in memory):

+---------+------+------+------+------+-------+
|         | 0    | 1    | 2    | ...  | 49999 |
| row_id  +------+------+------+------+-------+
|         | text | text | text | ...  | text  |
+---------+------+------+------|------+-------+

The column names are integers, allowing slicing for pagination. The column values are a value at that particular index.

CQL3 table definition:

create table results (
    row_id text,
    index int,
    value text,
    primary key (row_id, index)
) 
with compact storage;

As I already have the row_id and all 50,000 name/value pairs in memory, I just want to insert a single row into Cassandra in a single request/operation so it is as fast as possible.

The only thing I can seem to find is to do execute the following 50,000 times:

INSERT INTO results (row_id, index, value) values (my_row_id, ?, ?);

the first ? is is an index counter (i) and the second ? is the text value to store at location i.

This takes a lot of time. Even when we put the above INSERTs into a batch, it takes a lot of time.

We have all the data we need (the complete row) in its entirety, I would assume it to be very easy to just say "here, Cassandra, store this data as a single row in one request", for example:

//EXAMPLE-BUT-INVALID CQL3 SYNTAX:
insert into results (row_id, (index,value)) values 
    ((0,text0), (1,text1), (2,text2), ..., (N,textN));

This example isn't possible via current CQL3 syntax, but I hope it illustrates the desired effect: everything would be inserted as a single query.

Is it possible to do this in CQL3 and the DataStax Java Driver? If not, I suppose I'll be forced to use Hector or the Astyanax driver and the Thrift batch_insert operation instead?

Phenacaine answered 29/8, 2013 at 22:50 Comment(2)
Have you tried using lists / sets / maps. For this case it should do the trick but, as Alex says, it would make an interesting addition to CQL3.Weinreb
Yes, we have tried, and it was reasonably fast, but it completely breaks the desired data model: you cannot do slice queries on CQL3 collections.Phenacaine
P
3

Edit: only 4 days after I posted this question regarding Cassandra 1.2.9, Cassandra 2.0 final was released. 2.0 supports batch prepared statements, which should be much faster than the non-batched CQL3 that was required to be used for C* < 2.0. We have not yet tested this to be sure.

When this question was posted 4 days ago on 30 August 2013, it was not possible in CQL3 for C* versions less than 2.0. It was only possible via a Thrift client, e.g. Astyanax's MutationBatch.

Per Alex's suggestion, I created CASSANDRA-5959 as a feature request, but it was marked as a duplicate to CASSANDRA-4693, which supposedly solved the issue for C* 2.0.

Phenacaine answered 30/8, 2013 at 18:31 Comment(2)
Thanks Les. While I do agree that this could be considered at this time a limitation of the java driver, I actually think it's more of a CQL limitation. Hopefully Cassandra guys will agree and add it.Kaffraria
As an addition to this - I had a conversation with thobbs on the #cassandra channel. He said that unlogged batches sent to one partition are performed as a single operation so fairly efficient queries can be obtained that way. The suggestion was to batch in sizes of ~1k to avoid putting too much heap pressure on a node.Intenerate
H
3

Multiple INSERTs / UPDATEs can be done using batch_mutate method in Thrift APIs, by making use of mutation multi-maps.

Map<byte[], Map<String, List<Mutation>>> mutationMap = new HashMap<byte[], Map<String, List<Mutation>>>();

List<Mutation> mutationList = new ArrayList<Mutation>();

mutationList.add(mutation);
Map<String, List<Mutation>> m = new HashMap<String, List<Mutation>>();

m.put(columnFamily, mutationList);

mutationMap.put(key, m);
client.batch_mutate(mutationMap, ConsistencyLevel.ALL);
Hoeg answered 30/8, 2013 at 12:51 Comment(1)
the question is how to do it with CQL 3 not thriftIncidentally
P
3

Edit: only 4 days after I posted this question regarding Cassandra 1.2.9, Cassandra 2.0 final was released. 2.0 supports batch prepared statements, which should be much faster than the non-batched CQL3 that was required to be used for C* < 2.0. We have not yet tested this to be sure.

When this question was posted 4 days ago on 30 August 2013, it was not possible in CQL3 for C* versions less than 2.0. It was only possible via a Thrift client, e.g. Astyanax's MutationBatch.

Per Alex's suggestion, I created CASSANDRA-5959 as a feature request, but it was marked as a duplicate to CASSANDRA-4693, which supposedly solved the issue for C* 2.0.

Phenacaine answered 30/8, 2013 at 18:31 Comment(2)
Thanks Les. While I do agree that this could be considered at this time a limitation of the java driver, I actually think it's more of a CQL limitation. Hopefully Cassandra guys will agree and add it.Kaffraria
As an addition to this - I had a conversation with thobbs on the #cassandra channel. He said that unlogged batches sent to one partition are performed as a single operation so fairly efficient queries can be obtained that way. The suggestion was to batch in sizes of ~1k to avoid putting too much heap pressure on a node.Intenerate
K
2
  1. CQL3 INSERT statement doesn't support multiple value tuples. But I think this could make an interesting addition to CQL so please submit a feature request.

  2. The DataStax Java driver is based on CQL so there's anything it can do if the statement is not supported.

  3. For the time being if you need this your best option would be to use a Thrift-based library (nb: I'm not very familiar with Thrift-based API to confirm this insert would be possible, but I think it should)

Kaffraria answered 30/8, 2013 at 6:45 Comment(3)
Just an update - this is definitely possible with Thrift. Our test w/ the Datastax Java Driver and a CQL3 batch (using the actual Batch API) on a local dev machine took 1.5 minutes. The same operation with Astyanax (via a MutationBatch aka batch_mutate) took 235 milliseconds. This does not bode well for the Datastax Java Driver in our project. That being said, I'm an appreciative open-source citizen, so I'll open a feature request.Phenacaine
@Les Hazlewood the perf is so bad with actual Bath API probably because it is plain text query (so parsing text is pricey). If you have the opportunity to test batch API using prepared statement I'll be interested having the results. There was a big debate about CQL3 perf vs ThriftKith
If I remember to post the results once we've tested, I certainly will!Phenacaine
K
0

Use Batch statement in CQL3 if you want to do multiple insert.

With C* 2.0, it'll be even easier and faster since they'll enable prepared statement in batch

Kith answered 31/8, 2013 at 19:26 Comment(3)
Per my original post, Batch statements in CQL3 for wide rows < C* 2.0 are extremely slow.Phenacaine
Completely agree with you Les Hazlewood. Fortunately C* 2.0 has just been released so you can go with it :)Kith
I tried prepared batch statements with Cassandra 2.0, and it is still painfully slow. #21779171Trevethick

© 2022 - 2024 — McMap. All rights reserved.