Cassandra query making - Cannot execute this query as it might involve data filtering and thus may have unpredictable performance

Asked 28/3, 2015 at 6:55 Answered 28/3, 2015 at 13:3

python cassandra cassandra-cli cqlengine

I'd the following Cassandra Model:-

class Automobile(Model):
    manufacturer = columns.Text(primary_key=True)
    year = columns.Integer(index=True)
    model = columns.Text(index=True)
    price = columns.Decimal(index=True)

I needed the following queries:-

q = Automobile.objects.filter(manufacturer='Tesla')
q = Automobile.objects.filter(year='something')
q = Automobile.objects.filter(model='something')
q = Automobile.objects.filter(price='something')

These all were working fine, until i wanted multiple column filtering, ie when I tried

q = Automobile.objects.filter(manufacturer='Tesla',year='2013')

it throws an error saying Cannot execute this query as it might involve data filtering and thus may have unpredictable performance.

I rewrote the query with allowed_filtering, but this is not an optimal solution.

Then upon reading more, I edited my model as follow:-

class Automobile(Model):
    manufacturer = columns.Text(primary_key=True)
    year = columns.Integer(primary_key=True)
    model = columns.Text(primary_key=True)
    price = columns.Decimal()

With this I was able to filter multiple coulms as well, without any warning.

When I did DESCRIBE TABLE automobile, it shows this creates composite key PRIMARY KEY ((manufacturer), year, model).

So, my question is what if I declare every attribute as primary key? Is there any problem with this, since I'll be able to filter multiple columns as well.

This is just a small model. What if I had a model such as:-

class UserProfile(Model):
    id = columns.UUID(primary_key=True, default=uuid.uuid4)
    model = columns.Text()
    msisdn = columns.Text(index=True)
    gender = columns.Text(index=True)
    imei1 = columns.Set(columns.Text)
    circle = columns.Text(index=True)
    epoch = columns.DateTime(index=True)
    cellid = columns.Text(index=True)
    lacid = columns.Text(index=True)
    mcc = columns.Text(index=True)
    mnc = columns.Text(index=True)
    installed_apps = columns.Set(columns.Text)
    otp = columns.Text(index=True)
    regtype = columns.Text(index=True)
    ctype = columns.Text(index=True)
    operator = columns.Text(index=True)
    dob = columns.DateTime(index=True)
    jsonver = columns.Text(index=True)

and if I declare every attribute as PK, is there any problem with this?

Chatelaine answered 28/3, 2015 at 6:55 Comment(0)

To understand this, you need to understand how cassandra stores data. The first key in the primary key is called the partition key. It defines the partition the row belongs to. All rows in a partition are stored together, and replicated together. Inside a partition, rows are stored according to the clustering keys. These are the columns in the PK that's not the partition key. So, if your PK is (a, b, c, d), a defines the partition. And in a particular partition (say, a = a1), the rows are stored sorted by b. And for each b, the rows are stored sorted by c...and so on. When querying, you hit one (or a few partitions), and then need to specify every successive clustering key up until the key you're looking for. These have to exact equalities except for the last clustering column specified in your query, which may be a range query.

In the previous example, you could thus do

where a = a1 and b > b1 where a = a1 and b=b1 and c>c1 where a = a1 and b=b1 and c=c1 and d > d1

but can't do this:

where a=a1 and c=c1

To do that, you'd need "allow filtering" (realistically, you should look at changing your model, or denormalizing at that point).

Now, on to your question about making every column part of the PK. You could do that, but remember, all writes in Cassandra are upserts. Rows are identified by their primary key. If you make every column part of the PK, you'll not be able to edit a row. You're not allowed to update the value of any column that's in the primary key.

Carnelian answered 28/3, 2015 at 8:17 Comment(4)

@ashiac- Thanks for the input. It was indeed helpful. Thanks a lot. But I've a doubt, lets say I 've a model USER with fields as imei,phone,datetime,model,dob, and few other fields. Now I want to filter on multiple coulmns such as phone,imei,model and dob. 1) Should I declare all these 4 fields as PK? 2) Lets say if all of these are declared as PK, then in future if user updates his mobile number, how will I update it, since its not allowed to update a PK. – Chatelaine 28/3, 2015 at 10:15

Please share your thoughts on what will be the right model for this? – Chatelaine 28/3, 2015 at 10:15

@user1162512 It is important to note that with compound PRIMARY KEYs in Cassandra, you need to query them in-order (meaning you cannot skip a key field). So simply adding columns to your primary key may not be the answer. – Fara 28/3, 2015 at 13:4

@ashiac- How would I perform such a case? For eg:- Let say, I've this same Automobile model, with N fields, where lets say N=10. If I want to filter by every N field. should I create a different model for every different filter type query. As bruce said, it would be better to use Solr for this, is there any alternative to that? – Chatelaine 31/3, 2015 at 5:46

The proper way to solve this, is to take a query-based modeling approach. Instead of one table with three secondary indexes, you should solve this with four (maybe three) tables and ZERO secondary indexes.

Your original table of Automobile is probably ok. Although I'd be curious to see your primary key definition. But so solve your query of Automobile.objects.filter(year='something') I would create an additional query table like this (note: defined in CQL):

CREATE TABLE automobileByYear (
  manufacturer text,
  year bigint,
  model text,
  price decimal,
  PRIMARY KEY ((year),manufacturer,model));

Assuming that you also create a corresponding class on the Python side for this model (AutomobileByYear), you could then serve a query like:

AutomobileByYear.objects.filter(year='2013')

Additionally, having manufacturer as your first clustering key would also allow this query to function:

AutomobileByYear.objects.filter(manufacturer='Tesla',year='2013')

Likewise, to solve for your query by model, I would create an additional query table (automobileByModel), with the PRIMARY KEY definition of the table re-ordered like this:

PRIMARY KEY ((model),manufacturer,year));

The order of your clustering keys (manufacturer and year) would vary by your query requirements, but the point is that model should be your partition key in this case.

EDIT

...but should it such that I should design my table as per my queries, thereby having a LOT of data redundancy. Let say, I've this same Automobile model, with N fields, where lets say N=10. If I want to filter by every N field. should I create a different model for every different filter type query.

In this day and age disk is WAY cheaper than it used to be. That being said, I understand that it isn't always easy to just throw more disk at a problem. The bigger problem I see is adjusting the DAO layer of your application to keep 10 tables in-sync.

In that case, I would advise integrating with a search tool like Elastic or Solr. In fact, the enterprise version of Cassandra integrates with Solr out-of-the-box. If you really do need to run queries on 10+ columns, a robust search tool would compliment your Cassandra cluster nicely.

Fara answered 28/3, 2015 at 13:3 Comment(5)

@Bruce- Thanks for the information. But I've a serious question. Though I know, writes in Cassandra are cheap, but should it such that I should design my table as per my queries, thereby having a LOT of data redundancy. – Chatelaine 30/3, 2015 at 8:59

For eg:- Let say, I've this same Automobile model, with N fields, where lets say N=10. If I want to filter by every N field. should I create a different model for every different filter type query. – Chatelaine 30/3, 2015 at 9:1

@user1162512 Edit made. – Fara 30/3, 2015 at 12:41

Thanks Bruce for the information. Can you please help me with a link to that page which explains querying 10+ columns in Solr on cassandra. It will be really helpful Thanks – Chatelaine 31/3, 2015 at 5:27

I looked over DSE for the same, but its enterprise version not the community one. Any lead to this will be helpful. – Chatelaine 31/3, 2015 at 5:44

Recommended topics

Hot tags