mysqlsqlaggregate-functionsquery-performancemysql-variables

Optimizing COUNT(DISTINCT) slowness even with covering indexes


We have a table in MySql with arround 30 million records, the following is table structure

CREATE TABLE `campaign_logs` (
  `domain` varchar(50) DEFAULT NULL,
  `campaign_id` varchar(50) DEFAULT NULL,
  `subscriber_id` varchar(50) DEFAULT NULL,
  `message` varchar(21000) DEFAULT NULL,
  `log_time` datetime DEFAULT NULL,
  `log_type` varchar(50) DEFAULT NULL,
  `level` varchar(50) DEFAULT NULL,
  `campaign_name` varchar(500) DEFAULT NULL,
  KEY `subscriber_id_index` (`subscriber_id`),
  KEY `log_type_index` (`log_type`),
  KEY `log_time_index` (`log_time`),
  KEY `campid_domain_logtype_logtime_subid_index` (`campaign_id`,`domain`,`log_type`,`log_time`,`subscriber_id`),
  KEY `domain_logtype_logtime_index` (`domain`,`log_type`,`log_time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |

Following is my query

I'm doing UNION ALL instead of using IN operation

SELECT log_type,
       DATE_FORMAT(CONVERT_TZ(log_time,'+00:00','+05:30'),'%l %p') AS log_date,
       count(DISTINCT subscriber_id) AS COUNT,
       COUNT(subscriber_id) AS total
FROM stats.campaign_logs USE INDEX(campid_domain_logtype_logtime_subid_index)
WHERE DOMAIN='xxx'
  AND campaign_id='123'
  AND log_type = 'EMAIL_OPENED'
  AND log_time BETWEEN CONVERT_TZ('2015-02-01 00:00:00','+00:00','+05:30') AND CONVERT_TZ('2015-03-01 23:59:58','+00:00','+05:30')
GROUP BY log_date

UNION ALL

SELECT log_type,
       DATE_FORMAT(CONVERT_TZ(log_time,'+00:00','+05:30'),'%l %p') AS log_date,
       COUNT(DISTINCT subscriber_id) AS COUNT,
            COUNT(subscriber_id) AS total
FROM stats.campaign_logs USE INDEX(campid_domain_logtype_logtime_subid_index)
WHERE DOMAIN='xxx'
  AND campaign_id='123'
  AND log_type = 'EMAIL_SENT'
  AND log_time BETWEEN CONVERT_TZ('2015-02-01 00:00:00','+00:00','+05:30') AND CONVERT_TZ('2015-03-01 23:59:58','+00:00','+05:30')
GROUP BY log_date

UNION ALL

SELECT log_type,
       DATE_FORMAT(CONVERT_TZ(log_time,'+00:00','+05:30'),'%l %p') AS log_date,
       COUNT(DISTINCT subscriber_id) AS COUNT,
            COUNT(subscriber_id) AS total
FROM stats.campaign_logs USE INDEX(campid_domain_logtype_logtime_subid_index)
WHERE DOMAIN='xxx'
  AND campaign_id='123'
  AND log_type = 'EMAIL_CLICKED'
  AND log_time BETWEEN CONVERT_TZ('2015-02-01 00:00:00','+00:00','+05:30') AND CONVERT_TZ('2015-03-01 23:59:58','+00:00','+05:30')
GROUP BY log_date,

Following is my Explain statement

+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+------------------------------------------+
| id | select_type  | table         | type  | possible_keys                             | key                                       | key_len | ref  | rows   | Extra                                    |
+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+------------------------------------------+
|  1 | PRIMARY      | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468     | NULL |  55074 | Using where; Using index; Using filesort |
|  2 | UNION        | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468     | NULL | 330578 | Using where; Using index; Using filesort |
|  3 | UNION        | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468     | NULL |   1589 | Using where; Using index; Using filesort |
| NULL | UNION RESULT | <union1,2,3>  | ALL   | NULL                                      | NULL                                      | NULL    | NULL |   NULL |                                          |
+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+------------------------------------------+
  1. I changed COUNT(subscriber_id) to COUNT(*) and observed no performance gain.

2.I removed COUNT(DISTINCT subscriber_id) from the query , then I got huge performance gain , I'm getting results in approx 1.5 sec, previously it was taking 50 sec - 1 minute. But I need distinct count of subscriber_id from the query

Following is explain when I remove COUNT(DISTINCT subscriber_id) from the query

+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+-----------------------------------------------------------+
| id | select_type  | table         | type  | possible_keys                             | key                                       | key_len | ref  | rows   | Extra                                                     |
+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+-----------------------------------------------------------+
|  1 | PRIMARY      | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468     | NULL |  55074 | Using where; Using index; Using temporary; Using filesort |
|  2 | UNION        | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468     | NULL | 330578 | Using where; Using index; Using temporary; Using filesort |
|  3 | UNION        | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468     | NULL |   1589 | Using where; Using index; Using temporary; Using filesort |
| NULL | UNION RESULT | <union1,2,3>  | ALL   | NULL                                      | NULL                                      | NULL    | NULL |   NULL |                                                           |
+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+-----------------------------------------------------------+
  1. I ran three queries individually by removing UNION ALL. ONe query took 32seconds , others are taking 1.5 seconds each, but first query is dealing with around 350K records and others are dealing with only 2k rows

I could solve my performance problem by leaving out COUNT(DISTINCT...) but I need those values. Is there a way to refactor my query, or add an index, or something, to get the COUNT(DISTINCT...) values, but much faster?

UPDATE the following information is about data distribution of above table

for 1 domain 1 campaign 20 log_types 1k-200k subscribers

The above query I'm running for , the domain having 180k+ subscribers.


Solution

  • If the query without the count(distinct) is going much faster, perhaps you can do nested aggregation:

    SELECT log_type, log_date,
           count(*) AS COUNT, sum(cnt) AS total
    FROM (SELECT log_type,
                 DATE_FORMAT(CONVERT_TZ(log_time,'+00:00','+05:30'),'%l %p') AS log_date,
                 subscriber_id, count(*) as cnt
          FROM stats.campaign_logs USE INDEX(campid_domain_logtype_logtime_subid_index)
          WHERE DOMAIN = 'xxx' AND
                campaign_id = '123' AND
                log_type IN ('EMAIL_SENT', 'EMAIL_OPENED', 'EMAIL_CLICKED') AND
                log_time BETWEEN CONVERT_TZ('2015-02-01 00:00:00','+00:00','+05:30') AND 
                                 CONVERT_TZ('2015-03-01 23:59:58','+00:00','+05:30')
          GROUP BY log_type, log_date, subscriber_id
         ) l
    GROUP BY logtype, log_date;
    

    With a bit of luck, this will take 2-3 seconds rather than 50. However, you might need to break this out into subqueries, to get full performance. So, if this does not have a significant performance gain, change the in back to = one of the types. If that works, then the union all may be necessary.

    EDIT:

    Another attempt is to use variables to enumerate the values before the group by:

    SELECT log_type, log_date, count(*) as cnt,
           SUM(rn = 1) as sub_cnt
    FROM (SELECT log_type,
                 DATE_FORMAT(CONVERT_TZ(log_time,'+00:00','+05:30'),'%l %p') AS log_date,
                 subscriber_id,
                 (@rn := if(@clt = concat_ws(':', campaign_id, log_type, log_time), @rn + 1,
                            if(@clt := concat_ws(':', campaign_id, log_type, log_time), 1, 1)
                           )
                  ) as rn
          FROM stats.campaign_logs USE INDEX(campid_domain_logtype_logtime_subid_index) CROSS JOIN
               (SELECT @rn := 0)
          WHERE DOMAIN = 'xxx' AND
                campaign_id = '123' AND
                log_type IN ('EMAIL_SENT', 'EMAIL_OPENED', 'EMAIL_CLICKED') AND
                log_time BETWEEN CONVERT_TZ('2015-02-01 00:00:00', '+00:00', '+05:30') AND 
                                 CONVERT_TZ('2015-03-01 23:59:58', '+00:00', '+05:30')
          ORDER BY log_type, log_date, subscriber_id
         ) t
    GROUP BY log_type, log_date;
    

    This still requires another sort of the data, but it might help.