MySQL not picking correct row count from index

I have a following table

CREATE TABLE `test_series_analysis_data` (
  `email` varchar(255) NOT NULL,
  `mappingId` int(11) NOT NULL,
  `packageId` varchar(255) NOT NULL,
  `sectionName` varchar(255) NOT NULL,
  `createdAt` datetime(3) DEFAULT NULL,
  `marksObtained` float NOT NULL,
  `updatedAt` datetime DEFAULT NULL,
  `testMetaData` longtext,
  PRIMARY KEY (`email`,`mappingId`,`packageId`,`sectionName`),
  KEY `rank_index` (`mappingId`,`packageId`,`sectionName`,`marksObtained`),
  KEY `mapping_package` (`mappingId`,`packageId`)
  ) ENGINE=InnoDB DEFAULT CHARSET=utf8 |

Following is the output of the explain for the queries:

explain select rank 
from (
   select email, @i:=@i+1 as rank 
   from test_series_analysis_data ta 
   join (select @i:=0) va 
   where mappingId = ?1 
   and packageId = ?2 
   and sectionName = ?3 
   order by marksObtained desc
) as inter 
where inter.email = ?4;

+----+-------------+------------+------------+--------+----------------------------+-------------+---------+-------+-------+----------+--------------------------+
| id | select_type | table      | partitions | type   | possible_keys              | key         | key_len | ref   | rows  | filtered | Extra                    |
+----+-------------+------------+------------+--------+----------------------------+-------------+---------+-------+-------+----------+--------------------------+
|  1 | PRIMARY     | <derived2> | NULL       | ref    | <auto_key0>                | <auto_key0> | 767     | const |    10 |   100.00 | NULL                     |
|  2 | DERIVED     | <derived3> | NULL       | system | NULL                       | NULL        | NULL    | NULL  |     1 |   100.00 | Using filesort           |
|  2 | DERIVED     | ta         | NULL       | ref    | rank_index,mapping_package | rank_index  | 4       | const | 20160 |     1.00 | Using where; Using index |
|  3 | DERIVED     | NULL       | NULL       | NULL   | NULL                       | NULL        | NULL    | NULL  |  NULL |     NULL | No tables used           |
+----+-------------+------------+------------+--------+----------------------------+-------------+---------+-------+-------+----------+--------------------------+

Query optimizer could have used both indexes but rank_index is a covering index so it got picked. What surprises me is the output of the following query:

explain select rank 
from ( 
  select email, @i:=@i+1 as rank 
  from test_series_analysis_data ta use index (mapping_package) 
  join (select @i:=0) va 
  where mappingId = ?1 
  and packageId = ?2 
  and sectionName = ?3 
  order by marksObtained desc
) as inter 
where inter.email = ?4;

+----+-------------+------------+------------+--------+-----------------+-----------------+---------+-------+-------+----------+-----------------------+
| id | select_type | table      | partitions | type   | possible_keys   | key             | key_len | ref   | rows  | filtered | Extra                 |
+----+-------------+------------+------------+--------+-----------------+-----------------+---------+-------+-------+----------+-----------------------+
|  1 | PRIMARY     | <derived2> | NULL       | ref    | <auto_key0>     | <auto_key0>     | 767     | const |    10 |   100.00 | NULL                  |
|  2 | DERIVED     | <derived3> | NULL       | system | NULL            | NULL            | NULL    | NULL  |     1 |   100.00 | Using filesort        |
|  2 | DERIVED     | ta         | NULL       | ref    | mapping_package | mapping_package | 4       | const | 19434 |     1.00 | Using index condition |
|  3 | DERIVED     | NULL       | NULL       | NULL   | NULL            | NULL            | NULL    | NULL  |  NULL |     NULL | No tables used        |
+----+-------------+------------+------------+--------+-----------------+-----------------+---------+-------+-------+----------+-----------------------+

Why are there rows lesser (19434<20160) when the index being used is mapping_package. rank_index can better select what is required so the row count should be lesser in rank_index.

So does this mean mapping_package index is better than rank_index for the given query?

Does it have any effect that sectionName is a varchar so both indexes should give similar performance?

Also I am assuming Using index condition is selecting only few rows from index and scanning some more. While in case Using where; Using index, optimizer has to only read the index and not table to get rows and then it is selecting some data. Then why Using where is missing while using rank_index?

Moreover why is the key_len for mapping_package is 4 when there are only two columns in the index?

Help appreciated.

(19434<20160) -- Both of those numbers are estimates. It is unusual for them to be that close. I'll bet if you did ANALYZE TABLE, both would change, possibly changing the inequality.

Notice something else: Using where; Using index versus Using index condition.

But first, let me remind you that, in InnoDB, the PRIMARY KEY columns are tacked onto the secondary key. So, effectively you have

KEY `rank_index`      (`mappingId`,`packageId`,`sectionName`,`marksObtained`,`email`)
KEY `mapping_package` (`mappingId`,`packageId`,`email`,`sectionName`)

Now let's decide what the optimal index should be: where mappingId = ?1 and packageId = ?2 and sectionName = ?3 order by marksObtained desc

First, the = parts of WHERE: mappingId, packageId, sectionName, in any order;
Then the ORDER BY column(s): marksObtained
Bonus: Finally if email (the only other column mentioned anywhere in the SELECT) is in the key, it will be "Covering".

This says that rank_index is "perfect", and the other index is not so good. Alas, EXPLAIN does not clearly say that.

You, too, could have figured this out -- all you needed is to study my blog: http://mysql.rjweb.org/doc.php/index_cookbook_mysql (Sorry; it's getting late, and I am getting cheeky.)

Other tips:

Don't blindly use (255). When a tmp table is needed, this can make the the tmp table bigger, hence less efficient. Lower the limit to something reasonable. Or...
If this is a huge table, you really ought to 'normalize' the strings, replacing them with maybe a 2-byte SMALLINT UNSIGNED. This will improve performance in other ways, such as decreasing costly I/O. (OK, 20 rows is pretty small, so this may not apply.)

Why is key_len 4? That implies that one column was used, namely the 4-byte INT mappingId. I would have expected it to use the second column, too. So, I am stumped. EXPLAIN FORMAT=JSON SELECT ... may provide more clues.

Recommended topics

Hot tags