Cassandra Error - Clustering column cannot be restricted (preceding column is restricted by a non-EQ relation)
Asked Answered
I

3

25

We are using Cassandra as the data historian for our fleet management solution. We have a table in Cassandra , which stores the details of journey made by the vehicle. The table structure is as given below

CREATE TABLE  journeydetails(
bucketid text,
vehicleid text,
starttime timestamp,
stoptime timestamp,
travelduration bigint,
PRIMARY KEY (bucketid,vehicleid,starttime,travelduration)
);

Where:

  1. bucketid :- partition key which is a combination of month and year
  2. vehicleid : -unique id of the vehicle
  3. starttime :- start time of the journey
  4. endtime :- endtime of the journey
  5. travelduration:- duration of travel in milliseconds

We would like to run the following query - get all the travels of a vehicle - 1234567 between 2015-12-1 and 2015-12-3 whose travel duration is greater than 30 minutes

When I run this query:

select * from  journeydetails where bucketid in('2015-12') and vehicleid in('1234567') 
  and starttime > '2015-12-1 00:00:00' and starttime < '2015-12-3 23:59:59' 
  and travelduration > 1800000;

I get this result:

InvalidRequest: code=2200 [Invalid query] message="Clustering column "travelduration"
cannot be restricted (preceding column "starttime" is restricted by a non-EQ relation)

Does anyone have a recommendation on how to fix this issue?

Insertion answered 22/12, 2015 at 18:47 Comment(0)
R
39
select * from journeydetails where bucketid in('2015-12') and vehicleid in('1234567')
  and starttime > '2015-12-1 00:00:00' and starttime < '2015-12-3 23:59:59' 
  and travelduration > 1800000;

That's not going to work. The reason goes back to how Cassandra stores data on-disk. The idea with Cassandra is that it is very efficient at returning a single row with a precise key, or at returning a continuous range of rows from the disk.

Your rows are partitioned by bucketid, and then sorted on disk by vehicleid, starttime, and travelduration. Because you are already executing a range query (non-EQ relation) on starttime, you cannot restrict the key that follows. This is because the travelduration restriction may disqualify some of the rows in your range condition. This would result in an inefficient, non-continuous read. Cassandra is designed to protect you from writing queries (such as this), which may have unpredictable performance.

Here are two alternatives:

1- If you could restrict all of your key columns prior to travelduration (with an equals relation), then you could apply a your greater-than condition:

select * from journeydetails where bucketid='2015-12' and vehicleid='1234567'
  and starttime='2015-12-1 00:00:00' and travelduration > 1800000;

Of course, restricting on an exact starttime may not be terribly useful.

2- Another approach would be to omit travelduration altogether, and then your original query would work.

select * from journeydetails where bucketid='2015-12' and vehicleid='1234567'
  and starttime > '2015-12-1 00:00:00' and starttime < '2015-12-3 23:59:59';

Unfortunately, Cassandra does not offer a large degree of query flexibility. Many people have found success using a solution like Spark (alongside Cassandra) to achieve this level of reporting.

And just a side note, but don't use IN unless you have to. Querying with IN is similar to using a secondary index, in that Cassandra has to talk to several nodes to satisfy your query. Calling it with a single item probably isn't too big of a deal. But IN is one of those old RDBMS habits that you should really break before getting too deep into Cassandra.

Regretful answered 22/12, 2015 at 21:4 Comment(1)
Thanks for the comments Aron... I will explore Spark as an alternativeInsertion
M
-4

turns out you can just use 'allow filtering' at the end of the query

Mossberg answered 5/10, 2021 at 7:37 Comment(1)
Welcome to StackOverflow. How does this relate to the question? Please provide more details.Belong
S
-6

I run into same problem. Instead of playing with the clustered and primary keys, you could simply add an index to the needed columns. With this, you get rid of these pitfalls and you can use whatever column you want in your query.

Susann answered 24/10, 2019 at 21:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.