Does it make sense to use an index that will have a low cardinality?
Asked Answered
B

5

66

From my understanding you don't gain much by setting an index in a column that will hold few distinct values.

I have a column that holds a boolean value (actually it's a small int, but I'm using it as a flag), and this column is used in the WHERE clauses of most of the queries I have. In a theoretical "average" case, half of the records' values will be 1 and the other half, 0.

So, in this scenario, the database engine could avoid a full table scan, but will have to read a lot of rows anyway (total rows/2).

So, should I make this column an index?

I'm using Mysql 5, but I'm more interested in a general rationale on why it does / does not make sense indexing a column that I know that will have a low cardinality.

Bateau answered 21/1, 2010 at 21:46 Comment(0)
H
110

An index can help even on low cardinality fields if:

  1. When one of possible values is very infrequent compared to the other values and you search for it.

    For instance, there are very few color blind women, so this query:

    SELECT  *
    FROM    color_blind_people
    WHERE   gender = 'F'
    

    would most probably benefit from an index on gender.

  2. When the values tend to be grouped in the table order:

    SELECT  *
    FROM    records_from_2008
    WHERE   year = 2010
    LIMIT 1
    

    Though there are only 3 distinct years here, records with earlier years are most probably added first so very many records would have to be scanned prior to returning the first 2010 record if not for the index.

  3. When you need ORDER BY / LIMIT:

    SELECT  *
    FROM    people
    ORDER BY
            gender, id
    LIMIT 1
    

    Without the index, a filesort would be required. Though it's somewhat optimized do to the LIMIT, it would still need a full table scan.

  4. When the index covers all fields used in the query:

    CREATE INDEX (low_cardinality_record, value)
    
    SELECT  SUM(value)
    FROM    mytable
    WHERE   low_cardinality_record = 3
    
  5. When you need DISTINCT:

    SELECT  DISTINCT color
    FROM    tshirts
    

    MySQL will use INDEX FOR GROUP-BY, and if you have few colors, this query will be instant even with millions of records.

    This is an example of a scenario when the index on a low cardinality field is more efficient than that on a high cardinality field.

Note that if DML performance is not much on an issue, then it's safe to create the index.

If optimizer thinks that the index is inefficient, the index just will not be used.

Huygens answered 21/1, 2010 at 22:10 Comment(2)
Hmm in the second example, since you're LIMIT 1 why does it matter if the table is chronoligcally INSERTED ? the index should quickly find the first 2010 row even if was not chronoligaclly inserted no?Lobar
@JoelBlum: "if not for the index"Huygens
I
13

It might be worth including the boolean field in a composite index. For example if you have a large table of messages which typically need to be ordered by Date but you also have a boolean Deleted field, so you often query it like this:

SELECT ... FROM Messages WHERE Deleted = 0 AND Date BETWEEN @start AND @end

You will definitely benefit from having a composite index on the Deleted and Date fields.

Imbecility answered 21/1, 2010 at 21:55 Comment(7)
Thanks. Maybe I should do some research on composite indices (I just know the exist, but haven't used them much really). I'm using this column in very similar way to your sample code (though there are joins and other stuff, but the WHERE clause always have this flag for marking soft deletion).Bateau
Further discussion of why a composite index is great for this case, and why the boolean should come first: #50240158Ernst
@RickJames That's only true for a low cardinality eq index & high cardinality range index though. Normally, the high cardinality key part should come first -- the only reason it is swapped for where bool & range is because otherwise it can't use both keys in the composite with the between.Treasure
@Treasure - False. Here's a discussion of such: #50240158Ernst
@RickJames That's what I said: In the situation in that discussion, and in Vince Bowdren's example, the (low, high) ordering makes sense, because the (high, low) ordering can not be used by MySQL. However, when you add a composite index for a query with multiple equality comparisons (where x =, instead of where x between), (high, low) is usually expected to perform better.Treasure
@RickJames If you reason about it: If you wanted to find all European monarchs/rulers between 0 and 1800CE (a long range of dates), you'd probably just get a list of all European monarchs EVER, and then discard the few which disqualify by their ruling period. On the other hand, if you want a list of European monarchs in power exactly on January the 1st 1500 (one exact date), you'd probably just get ALL ruling monarchs by date first, then discard the non-european ones.Treasure
@Treasure - INDEX(continent, date) and WHERE continent='Europe' AND date BETWEEN... need touch only rows (in the index's BTree) that are relevant. That is, nothing to 'discard', regardless of the date range.Ernst
D
4

When half of the records' values will be 1 and the other half 0, no point of putting an index on that column. The query optimizer is likely not to make use of it.

Typically, however, you have a small set of "active" records and an increasingly larger set of "inactive". For example in a bug tracking system, you care about active bugs and hardly every look at the completed and archived ones. For such a case, the trick is to use "dateInactivated" column that stores the timestamp of when the record is inactivated/deleted. As the name implies, the value is NULL while the record is active, but once inactivated, write in the system datetime. Thus, an index on that column ends up having high selectivity as the number of "deleted" records grows since each record will have a unique (not strictly speaking) value. The query would have

"... AND dateInactivated is NULL ..." 

as part of the predicate and the index will pull in just the right set of rows that you care about.

Diacritical answered 10/7, 2019 at 5:23 Comment(0)
F
3

I usually do a simple "have index" vs "don't have" index test. In my experience you get most of the performance on queries that use ORDER BY the indexed column. In case you have any sorting on that column, indexing will most likely help.

Foucquet answered 21/1, 2010 at 21:50 Comment(1)
Thanks for your answer. In this case, I'm not sorting on that column though. It's only there to mark a record as enabled / disabled. I'm using it for soft deletion, basically. That's why I have to use it in the WHERE clause of most queries.Bateau
P
3

IMHO it's of limited usefulness. I assume in most cases there is other criteria you're using in your queries in addition to the flag that probably help out a lot more.

At 50%, I'd probably do some benchmarking with/without and see if it makes much difference.

Peridotite answered 21/1, 2010 at 21:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.