Cassandra preventing duplicates

Asked 4/3, 2015 at 13:56 Answered 4/3, 2015 at 21:56

I have a simple table distributed by userId:

create table test (
  userId uuid,
  placeId uuid,
  visitTime timestamp,
  primary key(userId, placeId, visitTime)
) with clustering order by (placeId asc, visitTime desc);

Each pair (userId, placeId) can have either 1 or none visits. visitTime is just some data associated with it, used for sorting in queries like select * from test where userId = ? order by visitTime desc.

How can I require (userId, placeId) to be unique? I need to make sure that

insert into test (userId, placeId, timeVisit) values (?, ?, ?)

won't insert 2nd visit to (userId, placeId) with different time. Checking for existence before inserting isn't atomic, is there a better way?

Ox answered 4/3, 2015 at 13:56 Comment(5)

Are you trying to track the most recent visit time for a distinct user and place? Or what part of your table needs to be unique? Why don't you want to add visitTime to your PK? – Mallette 4/3, 2015 at 14:32

Your question is inapprehensible -- edit it explaining what you need and what you're trying to do (btw: your statement is incorrect in C*) – Litter 4/3, 2015 at 14:37

Clustering order only applies for the rows with the same partition key (which would be userid in your case). It would not apply to rows with different userids. See more info here: datastax.com/documentation/cql/3.0/cql/ddl/… – Bligh 4/3, 2015 at 14:39

@StefanPodkowinski: I've updated the question, it should be more precise now – Ox 4/3, 2015 at 15:3

@CarloBertuccini: Thanks for pointing it out, I hope it's fine now – Ox 4/3, 2015 at 15:3

Let me understand -- if the couple (userId, placeId) should be unique, (meaning that you don't have to put two rows with this pair of data) what is the timeVisit useful for in the primary key? Why would you perform a query using order by visitTime desc if this will have only one row?

If what you need is to prevent duplication you have 2 ways.

1 - Lightweight transaction -- this, using IF NOT EXISTS will do what you want. But as I explained here lightweight transactions are really slow due to a particular handling by cassandra

2 - USING TIMESTAMP Writetime enforcement - (be careful with it!***) The 'trick' is to force a decreasing TIMESTAMP

Let me give an example:

INSERT INTO users (uid, placeid , visittime , otherstuffs ) VALUES ( 1, 2, 1000, 'PLEASE DO NOT OVERWRITE ME') using TIMESTAMP 100;

This produces this output

select * from users;

 uid | placeid | otherstuffs                | visittime
-----+---------+----------------------------+-----------
   1 |       2 | PLEASE DO NOT OVERWRITE ME |      1000

Let's now decrease the timestamp

INSERT INTO users (uid, placeid , visittime , otherstuffs ) VALUES ( 1, 2, 2000, 'I WANT OVERWRITE YOU') using TIMESTAMP 90;

Now data in the table have not been updated, since there is a higher TS operation (100) for the couple (uid, placeid) -- in fact here the output has not changed

select * from users;

 uid | placeid | otherstuffs                | visittime
-----+---------+----------------------------+-----------
   1 |       2 | PLEASE DO NOT OVERWRITE ME |      1000

If performance matters then use solution 2, if performance doesn't matter then use solution 1. For solution 2 you could calculate a decreasing timestamp for each write using a fixed number minus the system time millis

eg:

Long decreasingTimestamp = 2_000_000_000_000L - System.currentTimeMillis();

*** this solution might lead to unexpected behaviour if, for instance, you want delete and then reinsert data. It is important to know that once you delete data you will be able to write them again only if the write operation will have a higher timestamp of the deletion one (if not specified, the timestamp used is the one of the machine)

HTH,
Carlo

Litter answered 4/3, 2015 at 21:56 Comment(1)

Saw this second approach blog.codezuki.com/blog/2015/01/20/unique-set-with-cassandra as well. I was just thinking what happens if the timestamp is always kept same? Will it help this case (i.e keeping track if uniques?) – Eck 19/12, 2016 at 4:59

With Cassandra each primary key (row key + clustering key) combination is unique. So if you have an entry with the primary key (A, B, C) and you insert another one, new, with the same (A, B, C) values, the old one will be overwritten.

In your case, you have a timeVisit attribute in the primary key, which makes this unusable in your case. You might want to rethink your scheme so you leave the timeVisit attribute out.

Obstetrician answered 4/3, 2015 at 15:55 Comment(1)

Thanks, but you've said everything I already know. I know why it doesn't work, I'm asking how to make it work. – Ox 4/3, 2015 at 16:9

If I understood your requirement correctly, you don't really need visitTime to be a part of the primary key. In your query you don't also need to sort by the visitTime since there will always be only one occurrence of the userId/placeId combination. You don't need to insert a "record" with no visitTime because you can safely assume that if your query returns 0 results, then the user has never visited this place.

So if you make your PRIMARY KEY be only userId, placeId then you can use lightweight transactions to accomplish your goal.

You can use then a simple insert into test (userId, placeId, timeVisit) values (?, ?, ?) IF NOT EXISTS, which will not overwrite if there is already a record with the provided userId/placeId combination.

Anachronistic answered 4/3, 2015 at 19:33 Comment(0)

Recommended topics

Hot tags