optimize mysql count query
Asked Answered
P

10

16

Is there a way to optimize this further or should I just be satisfied that it takes 9 seconds to count 11M rows ?

devuser@xcmst > mysql --user=user --password=pass -D marctoxctransformation -e "desc record_updates"                                                                    
+--------------+----------+------+-----+---------+-------+
| Field        | Type     | Null | Key | Default | Extra |
+--------------+----------+------+-----+---------+-------+
| record_id    | int(11)  | YES  | MUL | NULL    |       | 
| date_updated | datetime | YES  | MUL | NULL    |       | 
+--------------+----------+------+-----+---------+-------+
devuser@xcmst > date; mysql --user=user --password=pass -D marctoxctransformation -e "select count(*) from record_updates where date_updated > '2009-10-11 15:33:22' "; date                         
Thu Dec  9 11:13:17 EST 2010
+----------+
| count(*) |
+----------+
| 11772117 | 
+----------+
Thu Dec  9 11:13:26 EST 2010
devuser@xcmst > mysql --user=user --password=pass -D marctoxctransformation -e "explain select count(*) from record_updates where date_updated > '2009-10-11 15:33:22' "      
+----+-------------+----------------+-------+--------------------------------------------------------+--------------------------------------------------------+---------+------+----------+--------------------------+
| id | select_type | table          | type  | possible_keys                                          | key                                                    | key_len | ref  | rows     | Extra                    |
+----+-------------+----------------+-------+--------------------------------------------------------+--------------------------------------------------------+---------+------+----------+--------------------------+
|  1 | SIMPLE      | record_updates | index | idx_marctoxctransformation_record_updates_date_updated | idx_marctoxctransformation_record_updates_date_updated | 9       | NULL | 11772117 | Using where; Using index | 
+----+-------------+----------------+-------+--------------------------------------------------------+--------------------------------------------------------+---------+------+----------+--------------------------+
devuser@xcmst > mysql --user=user --password=pass -D marctoxctransformation -e "show keys from record_updates"
+----------------+------------+--------------------------------------------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+
| Table          | Non_unique | Key_name                                               | Seq_in_index | Column_name  | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+----------------+------------+--------------------------------------------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+
| record_updates |          1 | idx_marctoxctransformation_record_updates_date_updated |            1 | date_updated | A         |        2416 |     NULL | NULL   | YES  | BTREE      |         | 
| record_updates |          1 | idx_marctoxctransformation_record_updates_record_id    |            1 | record_id    | A         |    11772117 |     NULL | NULL   | YES  | BTREE      |         | 
+----------------+------------+--------------------------------------------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+
Pericline answered 9/12, 2010 at 16:16 Comment(5)
what's the engine - MyISAM or InnoDB?Life
Granted, you have 11 million rows in the table, but on average, how many entries are actually being updated in any given day? OR, is the 11 million rows, the number that ARE being updated. If 11 million updates per day, I would go with a summary table.Airlee
engine is myisam. summary table won't work for this - data is changing and queries could be anything (by the second)Pericline
Counting 1,222,222.2 rows per second. Perhaps I'm just a poor, disadvantaged webpage scripter, but I'd take those results in a heartbeat.Philippines
Can you share some context for this query to us? It seems like you are looking for too big an improvement to solve via simple "tuning" techniques. You need to exploit the table's usage patterns in order to go sub-second with this.Kitchenmaid
A
29

If mysql has to count 11M rows, there really isn't much of a way to speed up a simple count. At least not to get it to a sub 1 second speed. You should rethink how you do your count. A few ideas:

  1. Add an auto increment field to the table. It looks you wouldn't delete from the table, so you can use simple math to find the record count. Select the min auto increment number for the initial earlier date and the max for the latter date and subtract one from the other to get the record count. For example:

    SELECT min(incr_id) min_id FROM record_updates WHERE date_updated BETWEEN '2009-10-11 15:33:22' AND '2009-10-12 23:59:59';
    SELECT max(incr_id) max_id FROM record_updates WHERE date_updated > DATE_SUB(NOW(), INTERVAL 2 DAY);`
    
  2. Create another table summarizing the record count for each day. Then you can query that table for the total records. There would only be 365 records for each year. If you need to get down to more fine grained times, query the summary table for full days and the current table for just the record count for the start and end days. Then add them all together.

If the data isn't changing, which it doesn't seem like it is, then summary tables will be easy to maintain and update. They will significantly speed things up.

Assist answered 12/12, 2010 at 3:12 Comment(6)
+1 for summary table suggestion. In this case, you may choose to denormalize some smaller amount of information that can be used to generate the number you need. Just be very careful on proper maintenance of the redundant dataCrosier
summary is ok, auto increment is not - it is not guaranteed to be consecutiveLife
sorry it's taken me a while to respond - I've been out of town. Those are good suggestions, but won't work for me as the data is updated and the queries could be any data to any other data (by the second). I'm currently investigating loading the index into memory.Pericline
summary tables only work if the data set itself and desired set of fields to calculate against never changes. And as everyone knows, requirements always change.Philippines
@anderson can you update the summaries when you update the base data? The 'by the second' queries are no problem if you can keep the summaries in sync (see my answer below)Raceway
+1 for doing the simple math for tables that don't expect removing any rows.Obla
S
7

Since >'2009-10-11 15:33:22' contains most of the records,
I would suggest to do a reverse matching like <'2009-10-11 15:33:22' (mysql work less harder and less rows involved)

select 
  TABLE_ROWS -
  (select count(*) from record_updates where add_date<"2009-10-11 15:33:22") 
from information_schema.tables 
where table_schema = "marctoxctransformation" and table_name="record_updates"

You can combine with programming language (like bash shell)
to make this calculation a bit smarter...
such as do execution plan first to calculate which comparison will use lesser row

From my testing (around 10M records), the normal comparison takes around 3s,
and now cut-down to around 0.25s

Slapup answered 17/12, 2010 at 1:44 Comment(0)
M
5

MySQL doesn't "optimize" count(*) queries in InnoDB because of versioning. Every item in the index has to be iterated over and checked to make sure that the version is correct for display (e.g., not an open commit). Since any of your data can be modified across the database, ranged selects and caching won't work. However, you possibly can get by using triggers. There are two methods to this madness.

This first method risks slowing down your transactions since none of them can truly run in parallel: use after insert and after delete triggers to increment / decrement a counter table. Second trick: use those insert / delete triggers to call a stored procedure which feeds into an external program which similarly adjusts values up and down, or acts upon a non-transactional table. Beware that in the event of a rollback, this will result in inaccurate numbers.

If you don't need an exact numbers, check out this query:

select table_rows from information_schema.tables
where table_name = 'foo';

Example difference: count(*): 1876668, table_rows: 1899004. The table_rows value is an estimation, and you'll get a different number every time even if you database doesn't change.

For my own curiosity: do you need exact numbers that are updated every second? IF so, why?

Mohamed answered 17/12, 2010 at 14:46 Comment(0)
R
3

If the historical data is not volatile, create a summary table. There are various approaches, the one to choose will depend on how your table is updated, and how often.

For example, assuming old data is rarely/never changed, but recent data is, create a monthly summary table, populated for the previous month at the end of each month (eg insert January's count at the end of February). Once you have your summary table, you can add up the full months and the part months at the beginning and end of the range:

select count(*) 
from record_updates 
where date_updated >= '2009-10-11 15:33:22' and date_updated < '2009-11-01';

select count(*) 
from record_updates 
where date_updated >= '2010-12-00';

select sum(row_count) 
from record_updates_summary 
where date_updated >= '2009-11-01' and date_updated < '2010-12-00';

I've left it split out above for clarity but you can do this in one query:

select ( select count(*)
         from record_updates 
         where date_updated >= '2010-12-00'
               or ( date_updated>='2009-10-11 15:33:22' 
                    and date_updated < '2009-11-01' ) ) +
       ( select count(*) 
         from record_updates 
         where date_updated >= '2010-12-00' );

You can adapt this approach for make the summary table based on whole weeks or whole days.

Raceway answered 13/12, 2010 at 14:40 Comment(0)
S
2

You should add an index on the 'date_updated' field.

Another thing you can do if you don't mind changing the structure of the table, is to use the timestamp of the date in 'int' instead of 'datetime' format, and it might be even faster. If you decide to do so, the query will be

select count(date_updated) from record_updates where date_updated > 1291911807
Semester answered 9/12, 2010 at 16:20 Comment(5)
From what I see there is already an index on date_updated field.Archery
yes, there's already an index on that column... or are you suggesting I should change something about the index?Pericline
in re: to the second point of changing the datetime to an int... I have a similar issue on another table/column that is an int and it doesn't perform much faster.Pericline
-1 for suggesting a datetime should be stored as anything other than a datetime field even if it is faster. Huge cost, tiny (if any) benefit.Raceway
so I tried it (8 bytes to 4 bytes). It went from 9 seconds to 7 seconds. Not a big enough improvement for me to add the complexity.Pericline
A
1

There is no primary key in your table. It's possible that in this case it always scans the whole table. Having a primary key is never a bad idea.

Archery answered 9/12, 2010 at 16:42 Comment(0)
K
1

If you need to return the total table's row count, then there is an alternative to the SELECT COUNT(*) statement which you can use. SELECT COUNT(*) makes a full table scan to return the total table's row count, so it can take a long time. You can use the sysindexes system table instead in this case. There is a ROWS column in the sysindexes table. This column contains the total row count for each table in your database. So, you can use the following select statement instead of SELECT COUNT(*):

SELECT rows FROM sysindexes WHERE id = OBJECT_ID('table_name') AND indid < 2

This can improve the speed of your query.

EDIT: I have discovered that my answer would be correct if you were using a SQL Server database. MySQL databases do not have a sysindexes table.

Kemerovo answered 12/12, 2010 at 3:24 Comment(6)
@zedo How would this handle where date_updated > '2009-10-11 15:33:22' even on SQL Server?Raceway
I would just add another AND to the WHERE clause. However, I am not certain that the timestamp in sysindexes would coincide with the actual table you are looking at.Kemerovo
@zedo how could it? There is only 1 row in sysindexes for each index. Also the value of the rows field is an estimateRaceway
The value of the rows field is not merely an estimate depending upon the last time STATISTICS were updated for the db. If the DBCC SHOW_STATISTICS (table_name , index_name) command indicates that it is outdated then one may run the following command to update statistics: USE <database_name> EXEC sp_updatestatsKemerovo
@zedo it still should be considered an estimate unless you are the only user on that database - but this is an aside really as sysindexes can only give you an estimate of the total number of rows in the table, which is not what anderson wantsRaceway
if you are the DBA making sure this info is current or if you have auto-update enabled on the sysindexes table then it should not be considered an estimate. my edit from the other day already indicated that I realized that andersonbd1's MySQL db does not have a sysindexes table.Kemerovo
W
1

It depends on a few things but something like this may work for you

im assuming this count never changes as it is in the past so the result can be cached somehow

count1 = "select count(*) from record_updates where date_updated <= '2009-10-11 15:33:22'"

gives you the total count of records in the table, this is an approximate value in innodb table so BEWARE, depends on engine

count2 = "select table_rows from information_schema.`TABLES` where table_schema = 'marctoxctransformation' and TABLE_NAME = 'record_updates'"

your answer

result = count2 - count1

Wedded answered 14/12, 2010 at 14:43 Comment(0)
L
1

There are a few details I'd like you to clarify (would put into comments on the q, but it is actually easier to remove from here when you update your question).

  1. What is the intended usage of data, insert once and get the counts many times, or your inserts and selects are approx on par?
  2. Do you care about insert/update performance?
  3. What is the engine used for the table? (heck you can do SHOW CREATE TABLE ...)
  4. Do you need the counts to be exact or approximately exact (like 0.1% correct)
  5. Can you use triggers, summary tables, change schema, change RDBMS, etc.. or just add/remove indexes?
  6. Maybe you should explain also what is this table supposed to be? You have record_id with cardinality that matches the number of rows, so is it PK or FK or what is it? Also the cardinality of the date_updated suggests (though not necessarily correct) that it has same values for ~5,000 records on average), so what is that? - it is ok to ask a SQL tuning question with not context, but it is also nice to have some context - especially if redesigning is an option.

In the meantime, I'll suggest you to get this tuning script and check the recommendations it will give you (it's just a general tuning script - but it will inspect your data and stats).

Life answered 14/12, 2010 at 16:39 Comment(0)
D
0

Instead of doing count(*), try doing count(1), like this:-

select count(1) from record_updates where date_updated > '2009-10-11 15:33:22'

I took a DB2 class before, and I remember the instructor mentioned about doing a count(1) when we just want to count number of rows in the table regardless the data because it is technically faster than count(*). Let me know if it makes a difference.

NOTE: Here's a link you might be interested to read: http://www.mysqlperformanceblog.com/2007/04/10/count-vs-countcol/

Dynamo answered 12/12, 2010 at 2:34 Comment(4)
this has been proven to NOT be the case in Oracle. the performance is the same with count(*) an count(1)Crosier
@Randy: Is it the case in this situation -> MySQL as stated in OP?Radie
@Radie - I'm not sure, just trying to add some info that I know about Oracle. I do not have a My SQL instance to do the tests myself either... :(Crosier
actually, MySQL (ISAM, not InnoDB) is supposed to have optimizations for some Count(*) queries - like those with no conditions, or on a primary key. so this statement by limc should be inapplicable. feel free to test, though.Bagnio

© 2022 - 2024 — McMap. All rights reserved.