We have a table in MySql with arround 30 million records, the following is table structure
CREATE TABLE `campaign_logs` (
`domain` varchar(50) DEFAULT NULL,
`campaign_id` varchar(50) DEFAULT NULL,
`subscriber_id` varchar(50) DEFAULT NULL,
`message` varchar(21000) DEFAULT NULL,
`log_time` datetime DEFAULT NULL,
`log_type` varchar(50) DEFAULT NULL,
`level` varchar(50) DEFAULT NULL,
`campaign_name` varchar(500) DEFAULT NULL,
KEY `subscriber_id_index` (`subscriber_id`),
KEY `log_type_index` (`log_type`),
KEY `log_time_index` (`log_time`),
KEY `campid_domain_logtype_logtime_subid_index` (`campaign_id`,`domain`,`log_type`,`log_time`,`subscriber_id`),
KEY `domain_logtype_logtime_index` (`domain`,`log_type`,`log_time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |
Following is my query
I'm doing UNION ALL instead of using IN operation
SELECT log_type,
DATE_FORMAT(CONVERT_TZ(log_time,'+00:00','+05:30'),'%l %p') AS log_date,
count(DISTINCT subscriber_id) AS COUNT,
COUNT(subscriber_id) AS total
FROM stats.campaign_logs USE INDEX(campid_domain_logtype_logtime_subid_index)
WHERE DOMAIN='xxx'
AND campaign_id='123'
AND log_type = 'EMAIL_OPENED'
AND log_time BETWEEN CONVERT_TZ('2015-02-01 00:00:00','+00:00','+05:30') AND CONVERT_TZ('2015-03-01 23:59:58','+00:00','+05:30')
GROUP BY log_date
UNION ALL
SELECT log_type,
DATE_FORMAT(CONVERT_TZ(log_time,'+00:00','+05:30'),'%l %p') AS log_date,
COUNT(DISTINCT subscriber_id) AS COUNT,
COUNT(subscriber_id) AS total
FROM stats.campaign_logs USE INDEX(campid_domain_logtype_logtime_subid_index)
WHERE DOMAIN='xxx'
AND campaign_id='123'
AND log_type = 'EMAIL_SENT'
AND log_time BETWEEN CONVERT_TZ('2015-02-01 00:00:00','+00:00','+05:30') AND CONVERT_TZ('2015-03-01 23:59:58','+00:00','+05:30')
GROUP BY log_date
UNION ALL
SELECT log_type,
DATE_FORMAT(CONVERT_TZ(log_time,'+00:00','+05:30'),'%l %p') AS log_date,
COUNT(DISTINCT subscriber_id) AS COUNT,
COUNT(subscriber_id) AS total
FROM stats.campaign_logs USE INDEX(campid_domain_logtype_logtime_subid_index)
WHERE DOMAIN='xxx'
AND campaign_id='123'
AND log_type = 'EMAIL_CLICKED'
AND log_time BETWEEN CONVERT_TZ('2015-02-01 00:00:00','+00:00','+05:30') AND CONVERT_TZ('2015-03-01 23:59:58','+00:00','+05:30')
GROUP BY log_date,
Following is my Explain statement
+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+------------------------------------------+
| 1 | PRIMARY | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468 | NULL | 55074 | Using where; Using index; Using filesort |
| 2 | UNION | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468 | NULL | 330578 | Using where; Using index; Using filesort |
| 3 | UNION | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468 | NULL | 1589 | Using where; Using index; Using filesort |
| NULL | UNION RESULT | <union1,2,3> | ALL | NULL | NULL | NULL | NULL | NULL | |
+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+------------------------------------------+
- I changed COUNT(subscriber_id) to COUNT(*) and observed no performance gain.
2.I removed COUNT(DISTINCT subscriber_id) from the query , then I got huge performance gain , I'm getting results in approx 1.5 sec, previously it was taking 50 sec - 1 minute. But I need distinct count of subscriber_id from the query
Following is explain when I remove COUNT(DISTINCT subscriber_id) from the query
+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+-----------------------------------------------------------+
| 1 | PRIMARY | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468 | NULL | 55074 | Using where; Using index; Using temporary; Using filesort |
| 2 | UNION | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468 | NULL | 330578 | Using where; Using index; Using temporary; Using filesort |
| 3 | UNION | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468 | NULL | 1589 | Using where; Using index; Using temporary; Using filesort |
| NULL | UNION RESULT | <union1,2,3> | ALL | NULL | NULL | NULL | NULL | NULL | |
+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+-----------------------------------------------------------+
- I ran three queries individually by removing UNION ALL. ONe query took 32seconds , others are taking 1.5 seconds each, but first query is dealing with around 350K records and others are dealing with only 2k rows
I could solve my performance problem by leaving out COUNT(DISTINCT...)
but I need those values. Is there a way to refactor my query, or add an index, or something, to get the COUNT(DISTINCT...)
values, but much faster?
UPDATE the following information is about data distribution of above table
for 1 domain 1 campaign 20 log_types 1k-200k subscribers
The above query I'm running for , the domain having 180k+ subscribers.
AND (log_type = 'EMAIL_OPENED' OR log_type = 'EMAIL_SENT' OR log_type = 'EMAIL_CLICKED')
– SnoodORDER BY NULL
after eachGROUP BY
this might get rid of the filesort. – BelshazzarEXPLAIN
makes it clear that your compound index is being used as you intend. Some things to try: 1) changeCOUNT(subscriber_id)
toCOUNT(*)
and see if performance improves. 2) try getting rid ofCOUNT(DISTINCT subscriber_id)
and see if performance improves. Run each of the three subqueries you're combining withUNION ALL
and see if one of them has poorer performance than the other two. Please update your question with results of those tests. – NettieGROUP BY HOUR( CONVERT_TZ(log_time,'+00:00','+05:30'))
in place of your log_date result set column formatted%l %p
. Numeric grouping may make it easier to process. – NettieDISTINCT
operation slowing things down. – NettieDISTINCT
: toGROUP
them engine sorts 350K rows by the result of theDATE_FORMAT
function and then steps through the result of the sort and counts rows in any order they appear. When you addDISTINCT
the engine has to sort again within each group. Kind of nested sort. Apparently, this is not handled efficiently. – SheebCOUNT(DISTINCT subscriber_id)
being off by about 1%? If so, (and you fix theBETWEEN
), I can explain how to revise the query (and other things) to speed it up by at least 100-fold. – StoutheartedIN
(so nounion
s) and remove thecount(distinct)
? – Creasy