Cassandra CQL - clustering order with multiple clustering columns

Asked 27/2, 2014 at 18:43 Answered 4/12, 2019 at 11:45

I have a column family with primary key definition like this:

...
PRIMARY KEY ((website_id, item_id), user_id, date)

which will be queried using queries such as:

SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND user_id = 0 AND date > 'some_date' ;

However, I'd like to keep my column family ordered by date only, such as SELECT date FROM myCF ; would return the most recent inserted date.

Due to the order of clustering columns, what I get is an order per user_id then per date. If I change the primary key definition to:

PRIMARY KEY ((website_id, item_id), date, user_id)

I can no longer run the same query, as date must be restricted is user_id is.

I thought there might be some way to say:

...
  PRIMARY KEY ((website_id, shop_id), store_id, date)
) WITH CLUSTERING ORDER BY (store_id RANDOMPLEASE, date DESC) ;

But it doesn't seem to exist. Worst, maybe this is completely stupid and I don't get why.

Is there any ways of achieving this? Am I missing something?

Many thanks!

Icon answered 27/2, 2014 at 18:43 Comment(0)

Your query example restricts user_id so that should work with the second table format. But if you are actually trying to run queries like

SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND date > 'some_date'

Then you need an additional table which is created to handle those queries, it would only order on Date and not on user id

Create Table LookupByDate ... PRIMARY KEY ((website_id, item_id), date)

Stapes answered 27/2, 2014 at 19:29 Comment(3)

Thanks for your answer @RussS. Second format: PRIMARY KEY ((website_id, item_id), date, user_id) prevents me from using filters: AND user_id = 0 AND date > 'some_date' ; since date is preceding user_id and is restricted by a non-EQ equation. Now second solution without user_id as partition column prevents me from having a per user data since unicity is constrainted by the triplet (website_id, item_id, date). Am I wrong somewhere? – Icon 28/2, 2014 at 9:56

Nope you just need to use both tables. Insert into both at write time. Ie you need one table for each of those query formats – Stapes 28/2, 2014 at 17:3

Ok I understand what you meant. Thanks for the answer, and thanks for the clarification! – Icon 28/2, 2014 at 17:34

In addition to your primary query, if all you try to get is "return the most recent inserted date", you may not need an additional table. You can use "static column" to store the last update time per partition. CASSANDRA-6561

Hunnicutt answered 6/12, 2014 at 20:52 Comment(0)

It probably won't help your particular case (since I imagine your list of all users is unmanagably large), but if the condition on the first clustering column is matching one of a relatively small set of values then you can use IN.

SELECT * FROM myCF
 WHERE website_id = 30 AND item_id = 10
   AND user_id IN ? AND date > 'some_date'

Don't use IN on the partition key because this will create an inefficient query that hits multiple nodes putting stress on the coordinator node. Instead, execute multiple asynchronous queries in parallel. But IN on a clustering column is absolutely fine.

Desmarais answered 4/12, 2019 at 11:45 Comment(0)

Recommended topics

Hot tags