MySQL index cardinality - performance vs storage efficiency
Asked Answered
P

1

23

Say you have a MySQL 5.0 MyISAM table with 100 million rows, with one index (other than primary key) on two integer columns.

From my admittedly poor understanding of B-tree structure, I believe that a lower cardinality means the storage efficiency of the index is better, because there are less parent nodes. Whereas a higher cardinality means less efficient storage, but faster read performance, because it has to navigate through less branches to get to whatever data it is looking for to narrow down the rows for the query.

(Note - by "low" vs "high", I don't mean e.g. 1 million vs 99 million for a 100 million row table. I mean more like 90 million vs 95 million)

Is my understanding correct?

Related question - How does cardinality affect write performance?

Perrin answered 8/4, 2010 at 2:23 Comment(4)
I'm not sure what you mean by "cardinality" here. Do you mean the block size used by the b-tree (probably b+-tree, actually) structure?Gschu
Cardinality, as in, the number of unique values. Higher cardinality = more unique values.Perrin
For example, here is a post I found that says higher cardinality will result in better read performance. But there aren't many articles I can find out there about this, and this is just some random blog, so I don't really know. databasedesign-resource.com/mysql-tuning.htmlPerrin
Also in that article, the recommendation for indexes on higher-cardinality columns is for a 1 column index. My question is for multi-column indexes, which may have different implications to what's happening behind the scenes.Perrin
M
37

Whereas a higher cardinality means less efficient storage, but faster read performance, because it has to navigate through less branches to get to whatever data it is looking for to narrow down the rows for the query.

Higher cardinality means better read performance because, by definition, there are fewer records to read.

To process a query like this:

SELECT  *
FROM    mytable
WHERE   indexed_col = @myvalue

, the engine should do the following steps:

  1. Find the first entry satisfying the condition.

    This is done traversing the B-Tree, starting from the root entry.

    Across the pages, the search is performed by following B-Tree links; within a page, the search is performed using binary search (unless your keys are compressed, in which case it's a linear search).

    This algorithm same efficiency for both high cardinality and low cardinality columns. Finding the first 3 (as opposed to any 3) in these lists:

    1  2  3  4  5  6  7  8  9  10
    
    3  3  3  3  3  3  3  3  4  4
    

    requires same O(log(n)) steps.

  2. Traversing the index until the key value changes. This, of course, requires linear time: the more records you have, the more you need to traverse.

If you only need the first record:

SELECT  *
FROM    mytable
WHERE   indexed_col = @myvalue
LIMIT 1

, the column cardinality does not affect read performance.

How does cardinality affect write performance?

Each index key has a hidden additional value: a record pointer. This is the whole point of having an index: you need to know which record does it point to.

Since a record pointer, by definition, is unique, each index key is unique too. The index entries sharing the same key value are sorted by the record pointer.

This is to make the index maintainable: if you delete a record with a value of an indexed column shared by a million of other records, the corresponding index record should be deleted too. But the whole million of the index records is not being looked through: instead, the record pointer is used as an additional search condition.

Each index key is in fact unique (even if you don't define the index as unique), and, hence, has maximum cardinality possible.

So the answer to your questions is: no, the column cardinality does not affect the index write performance.

Magenta answered 8/4, 2010 at 10:15 Comment(3)
Thank you for the highly detailed answer. My question was related to multi-column indexes, but your examples are for single-column indexes. Does that change anything? Also, storage efficiency is important to me as well. For multi-colum indexes, I was thinking that high cardinality of the first (left) column of the index would mean more storage space, vs having the lower cardinality column on the left. Higher cardinality on the left would mean more parent nodes, correct? Does that affect storage space at all? Thanks again :)Perrin
@Sean: this is also valid for composite indexes. If you have key compression enabled (in MyISAM), low cardinality columns can even save you some space (but they imply linear search in the pages, so it's a matter of tradeoff). The number of parent nodes totally depends on the number of records that can fit on a page.Magenta
@Magenta - With MyISAM going away, the "key compression" point is no longer valid. There is no good reason for considering cardinality of the columns of a composite index in InnoDB.Brooklime

© 2022 - 2024 — McMap. All rights reserved.