How do secondary indexes work in Cassandra?
Asked Answered
A

2

54

Suppose I have a column family:

CREATE TABLE update_audit (
  scopeid bigint,
  formid bigint,
  time timestamp,
  record_link_id bigint,
  ipaddress text,
  user_zuid bigint,
  value text,
  PRIMARY KEY ((scopeid, formid), time)
  ) WITH CLUSTERING ORDER BY (time DESC)

With two secondary indexes, where record_link_id is a high-cardinality column:

CREATE INDEX update_audit_id_idx ON update_audit (record_link_id);

CREATE INDEX update_audit_user_zuid_idx ON update_audit (user_zuid);

According to my knowledge Cassandra will create two hidden column families like so:

CREATE TABLE update_audit_id_idx(
    record_link_id bigint,
    scopeid bigint,
    formid bigint,
    time timestamp
    PRIMARY KEY ((record_link_id), scopeid, formid, time)
);

CREATE TABLE update_audit_user_zuid_idx(
    user_zuid bigint,
    scopeid bigint,
    formid bigint,
    time timestamp
    PRIMARY KEY ((user_zuid), scopeid, formid, time)
);

Cassandra secondary indexes are implemented as local indexes rather than being distributed like normal tables. Each node only stores an index for the data it stores.

Consider the following query:

select * from update_audit where scopeid=35 and formid=78005 and record_link_id=9897;
  1. How will this query execute 'under the hood' in Cassandra?
  2. How will a high-cardinality column index (record_link_id) affect its performance?
  3. Will Cassandra touch all nodes for the above query? Why?
  4. Which criteria will be executed first, base table partition_key or secondary index partition_key? How will Cassandra intersect these two results?
Alberic answered 17/4, 2015 at 7:19 Comment(3)
My 2 cents: Since you specified a full partition key, it makes no sense to query all the nodes. It should clearly only query a node responsible for (35, 78005). Due to the way Cassandra is designed, I'm expecting it to consider reducing the number of involved nodes as a priority. Given that, the only involved node should probably look at how many records it has for (35, 78005) and how many it has in the index for record_link_id=9897, and use the fastest one to serve the query (which isn't necessarily the smallest one, depending on whether the index is also sorted by primary key).Spectral
My theory seems to be supported by docs.datastax.com/en/cql/3.0/cql/ddl/…Spectral
if this the way, then creating index over high-cardinality columns will be the fastest and best data model (if you are including partition key also in criteria).Alberic
E
62
select * from update_audit where scopeid=35 and formid=78005 and record_link_id=9897;

How the above query will work internally in cassandra?

Essentially, all data for partition scopeid=35 and formid=78005 will be returned, and then filtered by the record_link_id index. It will look for the record_link_id entry for 9897, and attempt to match-up entries that match the rows returned where scopeid=35 and formid=78005. The intersection of the rows for the partition keys and the index keys will be returned.

How high-cardinality column (record_link_id)index will affect the query performance for the above query?

High-cardinality indexes essentially create a row for (almost) each entry in the main table. Performance is affected, because Cassandra is designed to perform sequential reads for query results. An index query essentially forces Cassandra to perform random reads. As cardinality of your indexed value increases, so does the time it takes to find the queried value.

Does cassandra will touch all nodes for the above query? WHY?

No. It should only touch a node that is responsible for the scopeid=35 and formid=78005 partition. Indexes likewise are stored locally, only contain entries that are valid for the local node.

creating index over high-cardinality columns will be the fastest and best data model

The problem here is that approach does not scale, and will be slow if update_audit is a large dataset. MVP Richard Low has a great article on secondary indexes(The Sweet Spot For Cassandra Secondary Indexing), and particularly on this point:

If your table was significantly larger than memory, a query would be very slow even to return just a few thousand results. Returning potentially millions of users would be disastrous even though it would appear to be an efficient query.

...

In practice, this means indexing is most useful for returning tens, maybe hundreds of results. Bear this in mind when you next consider using a secondary index.

Now, your approach of first restricting by a specific partition will help (as your partition should certainly fit into memory). But I feel the better-performing choice here would be to make record_link_id a clustering key, instead of relying on a secondary index.

Edit

How does having index on low cardinality index when there are millions of users scale even when we provide the primary key

It will depend on how wide your rows are. The tricky thing about extremely low cardinality indexes, is that the % of rows returned is usually greater. For instance, consider a wide-row users table. You restrict by the partition key in your query, but there are still 10,000 rows returned. If your index is on something like gender, your query will have to filter-out about half of those rows, which won't perform well.

Secondary indexes tend to work best on (for lack of a better description) "middle of the road" cardinality. Using the above example of a wide-row users table, an index on country or state should perform much better than an index on gender (assuming that most of those users don't all live in the same country or state).

Edit 20180913

For your answer to 1st question "How the above query will work internally in cassandra?", do you know what's the behavior when query with pagination?

Consider the following diagram, taken from the Java Driver documentation (v3.6):

enter image description here

Basically, paging will cause the query to break itself up and return to the cluster for the next iteration of results. It'd be less likely to timeout, but performance will trend downward, proportional to the size of the total result set and the number of nodes in the cluster.

TL;DR; The more requested results spread over more nodes, the longer it will take.

Expansion answered 17/4, 2015 at 13:46 Comment(10)
Thorough and insightful as usual.Erwinery
Thanks for the insight! How does having index on low cardinality index when there are millions of users scale even when we provide the partition key in the query like select * from users partitionkey=x and gender='M'. From the storage perspective, the hidden column family on gender, will it not overflow? will it cause problem since it needs to scan through hidden column family to filter out the result? #29660064Gerge
@Gerge Edit made.Expansion
@BryceAtNetwork23 thanks, but I'm speaking from the hidden CF perspective. If we provide whole primary key in the query in which case only one row would be there but the hidden gender CF is wide because there would be many people with same gender. If for example number of males are 50,000. Does it join all the user ids returned by the secondary index with one row returned by the primary key and filter the result? I'm particularly asking about the performance of this filtering when hidden CF rows are wide. Thanks in advance.Gerge
Could you answer the question ?#29660064Gerge
@Gerge Yes. Say your node indexes 50 users for a certain value. Your query specifically returns 1, but using a (superfluous) secondary index on the indexed value will return the entry contianing all 50 users.Expansion
How has this changed with SSTable Attached Secondary Indices (SASI) introduced in Cassandra 3.4?Aleksandrovsk
@grisaitis it depends on how you're trying to use it. By definition, the indexes are paired with the SSTables, which means that it will still exhibit poor performance when not paired with a partition key. For details, I'd suggest reading this article by Doan Duyhai, which examines the inner-workings of SASI indexes: doanduyhai.com/blog/?p=2058Expansion
@Expansion For your answer to 1st question "How the above query will work internally in cassandra?", do you know what's the behavior when query with pagination? I suppose with pagesize say 500, no matter how many rows select * from update_audit where scopeid=35 and formid=78005 and record_link_id=9897; will return, the response time should be same. But what I observed was response time increase when rows meets that condition increase. Really want to know whyThermobarometer
@Thermobarometer Edit made. Hope this helps.Expansion
F
4

Query with only secondary index is also possible in Cassandra 2.x

select * from update_audit where record_link_id=9897;

But this has a large impact on fetching data, because it reads all partitions on distributed environment. The data fetched by this query is also not consistent and could not relay on it.

Suggestion:
Use of Secondary index is considered to be a DIRT query from NoSQL Data Model view.

To avoid secondary index, we could create a new table and copy data to it. Since this is a query of the application, Tables are derived from queries.

Frankish answered 8/9, 2015 at 11:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.