performant ordering of keys in a MySQL compound index (WRT Rails Polymorphic associations and STI)
Asked Answered
H

2

8

Previously, I asked this question about compound indexes on polymorphic foreign keys in ActiveRecord. The basis of my question was my understanding that indexes should be based on the cardinality of your column, and there's generally pretty low cardinality on Rails's STI type and polymorphic _type columns.

Accepting that the answer to my question is right -- that's there's value to indexing both the high cardinality _id columns and the low cardinality _type columns, because they together they have a high cardinality -- my next question is: how should you order your compound indexes?

An index of [owner_id, owner_type] places the field with higher cardinality first, while [owner_type, owner_id] places the field with higher cardinality second. Is a query using the former key more performant than a query using the latter key, or are they equally performant?

I ask because this has particular bearing on how I would order the compound keys for tables serving STI models. STI Rails finders almost always query on the type column -- which again is a column of generally low cardinality. The type column is therefore queried much more often than other indexes. If the type column is queried much more often, then maybe it makes sense to use the type-leading index, because less specific queries could take advantage of the first part of the index yielding a performance-boost. However, I wouldn't smaller perk to come at the detriment of performance to highly-specific queries. that take advantage of the higher-cardinality portion of the index.

Hiett answered 9/2, 2011 at 16:1 Comment(1)
In some cases Rails's polymorphic associations only use 2-3 types, so maybe it makes sense to not even use polymorphic associations and instead use separate foreign key columns like: back_account_id, merchant_id, client_id. Then have separate indexes for each of these columns. Having many indexes slows down edits though... Duh - dilemma.Documentation
T
5

From my own research (but I'm no expert DBA) I've learned that there's two thing to consider when deciding the order of a compound key index.

First, concerning the cardinality of columns, index generally are better at searching columns with high cardinality. So I would be inclined to place the column with the highest cardinality first in the index. For reference, there's an article titled MySQL Query Optimization that says:

Indexes work best for columns that have a high cardinality relative to the number of rows in the table (that is, columns that have many unique values and few duplicates).

In your case, the _id columns would clearly fit better that definition, thus they're a better candidate for being a prefix of the key.

Another thing to consider would be the reusability of these indexes. Most (if not all) database systems allow a prefix of a compound key to be reused. For example, a compound key on (owner_id, owner_type) could also be used by queries on owner_id but not on owner_type.

So from what you explained in your question you might be better off with two indexes: a compound key index on (owner_id, owner_type) and a another on (owner_type).

Finally, it really all comes down to your dataset and queries. Try out multiple scenarios, benchmarks using different compound key ordering to see what is the most optimal solution. Also, don't forget that indexes incur a write penalty on your tables.

Update: There's also another rather popular SO question about compound key index there:

When should I use a composite index?

Titter answered 13/2, 2011 at 17:32 Comment(1)
Index by owner_type alone might not make a lot of sense, if you have a table with 100M rows and only 3-5 different values of owner_type - this would only slow down inserts/updates/deletes.Documentation
D
4

TL;DR Put the type first, then the id.

True, putting the id first would increase the cardinality of the first decision, making it easy to scan the resulting records or apply the second small index. However, if you ever query by type alone (which you will), you'll have to maintain another top-level index on the type, which will give you a performance hit on writes.

The other way around, [type, id], will give a top-level index that can be re-used when searching just by type. The second decision will always correspond to a single row, since id is unique by type, so you're still assured no row scanning after index resolution.

IMO the write performance hit of maintaining another index is not worth the marginal gain of not taking the type decision tree first.

Donothing answered 17/10, 2016 at 21:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.