Cassandra Allow filtering
Asked Answered
P

2

15

I have a table as below

CREATE TABLE test (
 day int,
 id varchar,  
 start int,
 action varchar,  
 PRIMARY KEY((day),start,id)
);

I want to run this query

Select * from test where day=1 and start > 1475485412 and start < 1485785654 
and action='accept' ALLOW FILTERING

Is this ALLOW FILTERING efficient?

I am expecting that cassandra will filter in this order

1. By Partitioning column(day)
2. By the range column(start) on the 1's result
3. By action column on 2's result. 

So the allow filtering will not be a bad choice on this query.

In case of the multiple filtering parameters on the where clause and the non indexed column is the last one, how will the filter work? Please explain.

Palmapalmaceous answered 3/3, 2017 at 10:18 Comment(1)
This video from datastax academcy is really helpful for this topicSaltus
V
7

Is this ALLOW FILTERING efficient?

When you write "this" you mean in the context of your query and your model, however the efficiency of an ALLOW FILTERING query depends mostly on the data it has to filter. Unless you show some real data this is a hard to answer question.

I am expecting that cassandra will filter in this order...

Yeah, this is what will happen. However, the inclusion of an ALLOW FILTERING clause in the query usually means a poor table design, that is you're not following some guidelines on Cassandra modeling (specifically the "one query <--> one table").

As a solution, I could hint you to include the action field in the clustering key just before the start field, modifying your table definition:

CREATE TABLE test (
 day int,
 id varchar,  
 start int,
 action varchar,  
 PRIMARY KEY((day),action,start,id)
);

You then would rewrite your query without any ALLOW FILTERING clause:

SELECT * FROM test WHERE day=1 AND action='accept' AND start > 1475485412 AND start < 1485785654

having only the minor issue that if one record "switches" action values you cannot perform an update on the single action field (because it's now part of the clustering key), so you need to perform a delete with the old action value and an insert it with the correct new value. But if you have Cassandra 3.0+ all this can be done with the help of the new Materialized View implementation. Have a look at the documentation for further information.

Vasileior answered 4/3, 2017 at 14:10 Comment(5)
If the query is going to scan the disk in the order which I have said, then allow filtering will not be bad choice right. Because it is filtering only a limited number of data.Palmapalmaceous
@Bharathi: Do you want to scan tombstones too? We know nothing about your data, so the general advice here will be "Don't use ALLOW FILTERING", but if you think you have a small bunch of records, go ahead and use it. After all, using the brain is absolutely the best way to model in Cassandra...Vasileior
I am just trying to understand ALLOW FILTERING. My use case is not the exact same one which I have stated here and I cannot put that here openly. I want to know that if cassandra filters on the order which I have stated, there will not any worry about the amount of data it is going to scan.Palmapalmaceous
@Vasileior is there a way to avoid scanning tombstones?Trackman
@Trackman the only way is to avoid deletes, so you scan live data only. If you want to "skip" not useful data, then change your model and make your queries more efficient. If you can't (both the previous), wait for a compaction...Vasileior
V
3

In general ALLOW FILTERING is not efficient.

But in the end it depends on the size of the data you are fetching (for which cassandra have to use ALLOW FILTERING) and the size of data its being fetched from.

In your case cassandra do not need filtering upto :

  1. By the range column(start) on the 1's result

As you mentioned. But after that, it will rely on filtering to search data, which you are allowing in query itself.

Now, keep following in mind

If your table contains for example a 1 million rows and 95% of them have the requested value, the query will still be relatively efficient and you should use ALLOW FILTERING.

On the other hand, if your table contains 1 million rows and only 2 rows contain the requested value, your query is extremely inefficient. Cassandra will load 999, 998 rows for nothing. If the query is often used, it is probably better to add an index on the time1 column.

So ensure this first. If it works in you favour, use FILTERING. Otherwise, it would be wise to add secondary index on 'action'.

PS : There is some minor edit.

Villosity answered 3/3, 2017 at 10:53 Comment(3)
I read this same answer on different question. This is not the answer for this question. Please be specific on how the query will work if you know.Palmapalmaceous
I don't know about that question, but I read this on this old datastax doc. Look through it. Although I still think my answer explains it pretty well. But if you find otherwise, please mention it here. Will clear my misconceptions as well (If any). Thanks!Villosity
Thanks for the update. I am aware that the ALLOW FILTERING will scan the entire column family. But the question here is, in case of the multiple filtering parameters on the where clause and the non indexed column is the last one, how will the filter works?? Please read the question and let me know if it requires any edit for better understanding.Palmapalmaceous

© 2022 - 2024 — McMap. All rights reserved.