Why does the index occupy even more space than data in my MySQL table?

D

3

5

I have a MySQL table with more than a billion rows hosted on Google Cloud SQL.

>> SHOW CREATE TABLE depth

CREATE TABLE `depth` (
  `date` date DEFAULT NULL,
  `receive_time` datetime(3) DEFAULT NULL,
  `instrument_token` bigint(20) unsigned DEFAULT NULL,
  `level` tinyint(3) unsigned DEFAULT NULL,
  `bid_count` smallint(5) unsigned DEFAULT NULL,
  `bid_size` bigint(20) unsigned DEFAULT NULL,
  `bid_price` float DEFAULT NULL,
  `ask_price` float DEFAULT NULL,
  `ask_size` bigint(20) unsigned DEFAULT NULL,
  `ask_count` smallint(5) unsigned DEFAULT NULL,
   KEY `date_time_sym` (`date`,`receive_time`,`instrument_token`),
   KEY `date_sym_time` (`date`,`instrument_token`,`receive_time`)
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8

To get the data and index size, I run the query

SHOW TABLE STATUS from mktdata where Name = "depth";

Here I get a table output with one row, with value of a few important fields as:

Name: depth
Engine:InnoDB
Version:10
Row_format:Dynamic
Rows: 1,72,08,21,447 
Avg_row_length: 78
Index_length: 1,83,90,03,07,456
Data_length:  1,35,24,53,32,480

Question: Why is Index_length larger than Data_length? You can see my indexes above, why do they need so much space to be stored? I don't know much about how indexes are created and stored so please explain from the basics.

Dawndawna answered 15/12, 2017 at 9:45 Comment(0)

B

2

Having Index_length > Data_length is rare, but not "bad" or "wrong".

You have not explicit PRIMARY KEY, so it is a hidden 6-byte field that works sort of like auto_increment.

Each secondary key contains a copy of the PK.

Avg_row_length: 78 -- That is computed from Date_length/Rows. But, Rows is an estimate. I've seen it be off by a factor of 2 or more.

Every column you have declared is NULLable; is that deliberate? Shouldn't most be NOT NULL? The following computation does not take into account that the columns can be NULL.

However, if a lot of values are NULL, then maybe 78 bytes/row is valid. For example, a BIGINT normally takes 8 bytes (plus overhead), but if NULL, it takes 0 bytes (plus overhead).

One Index size:

3 bytes for DATE
7 bytes for DATETIME(3)
8 bytes for BIGINT (the (20) is irrelevant)
6 bytes for hidden PK

That totals to 24 bytes.

24 bytes
add 20(?) bytes overhead per row
times 1.45 - overhead for BTree block splits

That equals 64 bytes per row. *1.72M rows = 110GB.

Double that because of 2 indexes -- 220GB. `SHOW TABLE STATUS says 184GB. Those two numbers are close enough. (Some of the numbers I used are only approximate.)

It is naughty not to have an explicit PRIMARY KEY. Using an AUTO_INCREMENT leads to running out of room in a 4-byte INT or using a huge 8-byte BIGINT. If some combination of columns are unique, they could be the PK. Doing so may shrink the data size (by gettring rid of the 6-byte PK) and may shrink the indexes (if any of the 3 index columns are part of the PK).

More on indexing: http://mysql.rjweb.org/doc.php/index_cookbook_mysql Note what it says about "ranges" and indexes. If you use a range on date or receive_time, the indexes you have will be less-than-optimal.

Burger answered 15/12, 2017 at 17:40 Comment(2)

thanks, nullable is deliberate. But currently there are almost no nulls in my dataset. primary key is something I may have to add. ill read the link. – Dawndawna 15/12, 2017 at 18:33

can you please also look at a related question on the same dataset I asked? #47809454 – Dawndawna 15/12, 2017 at 18:33

G

8

That can happen.

You have a heavily indexed table. This may or may not be useful.

Here are some common mistakes:

"I am indexing all the columns" -- usually not useful.
"I indexed every column I use" -- but failed to understand the importance of "compound" indexes: INDEX(last, first) versus INDEX(last), INDEX(first)
INDEX(a), INDEX(a,b) -- not realizing that the first one is redundant.
PRIMARY KEY(id), INDEX(id) -- not realizing that PRIMARY KEY is an INDEX (and UNIQUE).

SHOW CREATE TABLE and describe the main SELECTS. Then we can discuss which INDEXes would be optimal, and which can be dropped.

A further note: An INDEX contains a row for each row in the table, and contains the indexed columns, plus a pointer to the row in the DATA. If you have multiple indexes, they are all included in INDEX_LENGTH (except the PRIMARY KEY for InnoDB).

Or, it may be that you the optimal set of indexes, and the index size is bigger than the data.

Ref Link: https://forums.mysql.com/read.php?10,390235,390352

Grapple answered 15/12, 2017 at 9:58 Comment(1)

bytes per row in each index = 3 index columns + 1 pointer = 3(date) + 7(datetime)+8(`bigint) + 8 = 26 bytes. Two indexes = 56 bytes. Original table has 79 bytes per row. Close but something amiss. – Dawndawna 15/12, 2017 at 11:5

B

2