How to calculate the diskspace used by MySQL table from column datatypes and number of rows?
Asked Answered
A

1

1

I have a MySQL table with more than a 3 billion rows hosted on Google Cloud SQL.

I wish to understand how the total size on disk can be explained from the column data-types, number of rows and the indexes.

I was hoping that it would be something like

size_of_table_in_bytes = num_rows * [ Sum over i {bytes_for_datatype_of_column(i))}
                                    + Sum over j  {Bytes_for_index(j)} ]

But I end up with incorrect disk-size than how much my database size shows.

Using bytes per datatype on

https://dev.mysql.com/doc/refman/5.7/en/storage-requirements.html

and additional bytes in InnoDB header and indexes from

https://dev.mysql.com/doc/refman/5.7/en/innodb-physical-record.html#innodb-compact-row-format-characteristics

Here is my understanding of bytes occupied by header, each column and each index

TABLE `depth` (
Bytes| Column/Header/Index
    2| variable length header Ceil(num columns/8) = Ceil (10/8)
    5| Fixed Length Header
    3|`date` date DEFAULT NULL,
    7|`receive_time` datetime(3) DEFAULT NULL,
    8|`instrument_token` bigint(20) unsigned DEFAULT NULL,
    1|`level `tinyint(3) unsigned DEFAULT NULL,
    2|`bid_count` smallint(5) unsigned DEFAULT NULL,
    8|`bid_size` bigint(20) unsigned DEFAULT NULL,
    4|`bid_price` float DEFAULT NULL,
    4|`ask_price` float DEFAULT NULL,
    8|`ask_size` bigint(20) unsigned DEFAULT NULL,
    2|`ask_count` smallint(5) unsigned DEFAULT NULL,
    6|KEY `date_time_sym (`date`,`receive_time`,`instrument_token`),
    6|KEY `date_sym_time (`date`,`instrument_token`,`receive_time`)
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8`

which totals to 72 bytes.

But as per SHOW TABLE STATUS, Avg_row_length = 79.

Question 1: Where am I getting the number of bytes per row wrong?

I am reasonably sure that there are no nulls in my data.

Assuming, I am making some mistake in counting bytes, and using 79 bytes per row and counting rows using SELECT COUNT(*) as 3,017,513,240:

size_of_table_in_bytes = 79*3,017,513,240 = 238,383,545,960

Another way to get the size is to use MySQL query

SHOW TABLE STATUS from mktdata where Name = "depth";

Here I get a table output with one row, with value of a few important fields as:

Name: depth
Engine:InnoDB
Version:10
Row_format:Dynamic
Rows: 1,72,08,21,447 
Avg_row_length: 78
Index_length: 1,83,90,03,07,456
Data_length:  1,35,24,53,32,480 

At first I was alarmed, how Rows is 1.7 Billion instead of 3.01 Billion, but I found this in the documentation

  • Rows

The number of rows. Some storage engines, such as MyISAM, store the exact count. For other storage engines, such as InnoDB, this value is an approximation, and may vary from the actual value by as much as 40 to 50%. In such cases, use SELECT COUNT(*) to obtain an accurate count.

So, 3.01 Billion seems right for number of rows, and therefore I expect table size to be 238 GB.

But then, if I add up, Data_length and Index_length, I get 319,145,639,936

Question 2: Why do I get 319 GB by one method and 238 GB as another. Which one is right if any?

Moreover the overall size shown for the SQL database by Google Cloud SQL Console is 742 GB. The only other table I have, trade, has exactly 1/5th the number of rows of depth and 5 columns. It's size by summing Data_length and Index_length is 57 GB. If I add both the table sizes I get 376 GB.

Question 3: 742 GB seems roughly twice of 376 GB (actually 752). Could this be because of the back-up? I know Google Cloud SQL does an automatic back-up once a day?

Because of plausibility of Question 3 above, I got a doubt that my simple method of size = num_rows*num_bytes_per_row is wrong! This is really troubling me, and will appreciate any help in resolving this.

Adamite answered 14/12, 2017 at 8:31 Comment(5)
You are using MySQL not MS SQL Server, please remove SQL-Server tagTransmogrify
Do you have binary logs enabled? If yes you have to account for their size as well.Anonym
Thanks, I just checked log_bin = ON. How to check its size?Adamite
SHOW BINARY LOGS;Violaviolable
I get a table with Log_name,File_size as column. Sum of File_size across 872 rows is 85,661,147,002. This explains 85 GB of disk-space?Adamite
V
0
  • There is more overhead than you mentioned. 20 bytes/row might be close.
  • Don't trust SHOW TABLE STATUS for giving "Rows", use SELECT COUNT(*) ... Notice how it was off by nearly a factor of 2.
  • Compute the other way: 135245332480 / 3017513240 = 45 bytes.
  • From 45 bytes, I deduce that a lot of the cells are NULL?
  • Each column in each row has 1- or 2-byte overhead.
  • The ROW_FORMAT matters.
  • TEXT and BLOB (etc) have radically different rules than simple datatypes.
  • The indexes take a lot more than the 6 bytes you mentioned (see your other posting).
  • BTree structure has some overhead. When loaded in order, 15/16 of each block is filled (that is mentioned somewhere in the docs). After churn, the range can easily be 50-100% is filled; a BTree gravitates to 69% full (hence the 1.45 in the other posting).

Reserving an equal amount of space for backup...

  • I don't know if that is what they are doing.
  • If they use mysqldump (or similar), it is not a safe formula -- the text dump of the database could be significantly larger or smaller.
  • If they use LVM, then they have room for a full binary dump. But that does not make sense because of COW.
  • (So, I give up on Q3.)

Could the Cloud service be doing some kind of compression?

Violaviolable answered 15/12, 2017 at 19:48 Comment(4)
Disk usage does not include backups (they are stored separately). The rest of the disk usage likely comes from binary logs as I mentioned in a previous comment.Anonym
ROW_FORMAT=Dynamic. There are no TEXT or BLOB columns. Yes index thing I realized later. There can't be so many nulls. I'll count.Adamite
Vadim, Google Cloud provides a visual interface, where it is showing 750+ GB in my database. I'll need to ask them if it includes back-ups. Thanks for your log comment, 85 GB is explained.Adamite
I work on Google Cloud. As I mentioned in my first comment, the disk usage displayed does not contain anything related to backups.Anonym

© 2022 - 2024 — McMap. All rights reserved.