how to avoid secondary indexes in cassandra?

Asked 4/8, 2014 at 18:15 Answered 19/4, 2018 at 20:2

Solved cql3 cassandra-2.0 secondary-indexes

I have heard repeatedly that secondary indexes (in cassandra) is only for convenience but not for better performance. The only case where it is recommended to use secondary indexes when you have low cardinality (such as gender column which has two values male or female)

consider this example:

CREATE TABLE users ( 
userID uuid, 
firstname text, 
lastname text, 
state text, 
zip int, 
PRIMARY KEY (userID) 
);

right now I cannot do this query unless I create a secondary index on users on firstname index

select * from users where firstname='john'

How do I denormalize this table such that I can have this query: Is this the only efficient way by using composite keys? Any other alternatives or suggestions?

CREATE TABLE users ( 
    userID uuid, 
    firstname text, 
    lastname text, 
    state text, 
    zip int, 
    PRIMARY KEY (firstname,userID) 
    );

Luanaluanda answered 4/8, 2014 at 18:15 Comment(0)

In order to come up with a good data model, you need to identify first ALL queries you would like to perform. If you only need to look up users by their firstname (or firstname and userID), then your second design is fine...

If you also need to look up users by their last name, then you could create another table having the same fields but a primary key on (lastname, userID). Obviously you will need to update both tables in the same time. Data duplication is fine in Cassandra.

Still, if you are concerned about the space needed for the two or more tables, you could create a single users table partitioned by user id, and additional tables for the fields you want to query by:

CREATE TABLE users ( 
    userID uuid, 
    firstname text, 
    lastname text, 
    state text, 
    zip int, 
    PRIMARY KEY (userID) 
);

CREATE TABLE users_by_firstname (
    firstname text,
    userid uuid,
    PRIMARY KEY (firstname, userid)
);

The disadvantage of this solution is that you will need two queries to retrieve users by their first name:

SELECT userid FROM users_by_firstname WHERE firstname = 'Joe';
SELECT * FROM users WHERE userid IN (...);

Hope this helps

Airla answered 6/8, 2014 at 16:24 Comment(9)

Thanks for the explanation. But the question is about secondary indexes. – Luanaluanda 6/8, 2014 at 17:24

I thought that the essence of your question is "How do I denormalize this table such that I can have this query". As it was mentioned, secondary indexes don't perform very well, unless you pre-filter your data by the partition key. According to my tests on a relatively small cluster (5 nodes) they can be 10 times slower than queries by partition key. On a bigger cluster the difference could be even bigger. In my experience it is much better to simulate them with a secondary table, just as I did with the users_by_firstname table. – Airla 6/8, 2014 at 18:2

so you say doing two queries is faster than a single lookup on secondary index correct? – Luanaluanda 6/8, 2014 at 18:11

Correct. The reason behind is that rows are partitioned by the partition key, so during lookups Cassandra knows exactly which node holds the data. With secondary indexes each node maintains its own index and the query needs to be executed on all nodes, and then the results need to be combined. So yes, two queries by partition key are faster than one query by secondary index. Duplicating your tables with different primary keys is also a good solution. – Airla 6/8, 2014 at 18:34

I did not know that with secondary indexes, the query is executed on all nodes. so can you give a use case where secondary index will be helpful? – Luanaluanda 6/8, 2014 at 18:52

Taking your example, you could have the user table's primary key (firstname, userid). Then, you could create a secondary index on lastname. In this case the query SELECT * from users where firstname='Joe' and lastname='Doe' could benefit from the index on last name, because the data is already prefiltered by firstname, so it can be executed on a single node. Another use case would be when you are not too concerned about performance, just want a simple solution. – Airla 6/8, 2014 at 19:28

Let us continue this discussion in chat. – Luanaluanda 6/8, 2014 at 19:32

What if I want to update the firstname of the user? How will I do that? Since PK cannot be updated , how will one go about updating the firstname of the user? – Cannice 13/4, 2015 at 18:56

Changing the firstname will create a new row in Cassandra. You'll have to delete or inactivate the existing row. – Airla 14/4, 2015 at 8:23

There are a few way of doing this, all with pros and cons.

Your second query will work, but it's just an index table. http://wiki.apache.org/cassandra/SecondaryIndexes A secondary index can be helpful, and if you hit a partition first (which you can't do in your first table), then cassandra's implementation will save you hassle, and keep things "local atomic". Without hitting a partition though, your first table with the index will not be great with your query as it'll hit everything everywhere.
You can fully denormalise, but you can also do a look up table. i.e. Your second table can exist only to return the user id. You can then do a second query to fetch information for only the relevant partitions. If you're expecting few results, this can be good. If not, you'll be hitting many partitions across many nodes (which depending on your cluster size and hotspot avoidance criteria, can be good or bad). Doing many ~1ms queries are usually better than doing one ~1000ms query.
You can do artificial bucketing, and issue n=bucketcount queries. This has extra overhead, but reduces query count and can be a good option.
Your index might be of the first few characters of the firstname. Or it could be a consistent hash into a few buckets. The former can give you "starts with" semantics.

These are just a few options. Going from a logical data model to a physical one requires evaluation of which tradeoffs you wish to make.

Nerti answered 5/8, 2014 at 4:2 Comment(3)

please provide example table for points 2 and 3. I don't quite follow. – Luanaluanda 5/8, 2014 at 6:13

2: create table lookup ( firstname text primary key, userid uuid) 3: create table foo (bucketid int, somecol int, ... primary key (bucketid, your_cols). When querying you could then do ... WHERE bucketid in (1,2,3). That would issue 3 queries on 3 paritions. – Nerti 5/8, 2014 at 10:42

your point 2 is same as my first table in the post. that will not allow where firstname="john" kind of queries. The bucketid seems interesting. But I am not totally clear how it works. if you could edit the answer with how this bucketid could be used, it will be very helpful. Thanks – Luanaluanda 5/8, 2014 at 18:25

There's also Materialized views with automatic udpates that partition data on different columns, so therefore making reads much faster and avoid secondary indices altogether. There are some additional benefits of doing this on your own.

The general idea of avoiding hot partitions still remains.

And then, there is also SASI index if you are doing lot of updates on the materialized view primary key to avoid tombstones.

Hereford answered 19/4, 2018 at 20:2 Comment(0)

Recommended topics

Hot tags