Performance difference between synchronous SELECT + INSERT vs INSERT ... IF NOT EXISTS in CQL?

Asked 15/9, 2015 at 23:34 Answered 15/9, 2015 at 23:59

I have a CQL table (cql 3, cassandra 2.0.*) that looks something like:

CREATE TABLE IF NOT EXISTS user_things (
   user_id bigint,
   thing_id bigint,
   created_at timeuuid,
  PRIMARY KEY (user_id, thing_id)
);

I want to do an insert like

INSERT INTO user_things (user_id, thing_id, created_at) VALUES (?, ?, now())

but only if the row doesn't exist.

I could do this in two synchronous statements (first a SELECT, followed by an INSERT if the SELECT didn't return a row) or I could use INSERT ... IF NOT EXISTS.

The CQL docs state "But please note that using IF NOT EXISTS will incur a non negligible performance cost (internally, Paxos will be used) so this should be used sparingly."

I'm wondering if anybody has done benchmarking to see what is more performant if we have lots of these operations happening? (say hundreds a second)

Franny answered 15/9, 2015 at 23:34 Comment(0)

It depends a lot on what topology you are using. The IF NOT EXISTS is pretty fast if you restrict it to a local data center (with LOCAL_SERIAL) and use a small replication factor. If you try to use it across multiple data centers or with higher replication factors, then it slows down dramatically. There is an open ticket to improve its performance, so hopefully that will get completed soon since it is currently an overly expensive operation with lots of round trips.

Another thing that will slow IF NOT EXISTS down is when you use it on clustered rows. It seems to work the fastest when your table only has a compound partition key and no clustering columns.

If you go the read before write route, then you've got other problems to deal with. First off you will have a race condition since if two clients do a read around the same time, and then both decide to do a write, you'll get one overwriting the other, which kind of makes the read pointless (see another approach here: collision detection. If somehow you don't mind the race condition, and use a low consistency like ONE for the read and write, then it will likely outperform IF NOT EXISTS.

Pretty much you'd have to benchmark it for your system and schema to see which one was faster in your situation.

Laryssa answered 15/9, 2015 at 23:59 Comment(2)

by "compound partition key" you mean like PRIMARY KEY ((user_id, thing_id)) vs PRIMARY KEY (user_id, thing_id) (with clustering columns) ? – Franny 16/9, 2015 at 1:1

Yes, that's what I mean. – Laryssa 16/9, 2015 at 1:36

While I have not done the benchmarking myself, I would imagine that the two synchronous statements would be faster to operate because simply, it's not doing as much. It's executing two well-designed CQL queries, whereas the other method involves at least 4 communication 'phases' between the nodes.

But if you do use this method, are you able to guarantee that these queries are executed atomically and that there won't be an INSERT with the same user_id and thing_id in the time between running the SELECT and running the INSERT? The need to avoid this situation is what drives using lightweight transactions in Cassandra and Paxos in general.

Acidify answered 15/9, 2015 at 23:50 Comment(0)

Recommended topics

Hot tags