mysqlmysql-dependent-subquery

MySQL : Dependent Sub Query with NOT IN in the WHERE clause is very slow


I am auditing user details from my application using open Id login .If a first time a user is login a OPEN ID we consider as signup . I am generating audit signin report using this details . Sample Table Data.

+---------+----------+-----------+---------------+
| USER_ID | PROVIDER | OPERATION | TIMESTAMP     |
+---------+----------+-----------+---------------+
|     120 | Google   | SIGN_UP   | 1347296347000 |
|     120 | Google   | SIGN_IN   | 1347296347000 |
|     121 | Yahoo    | SIGN_IN   | 1347296347000 |
|     122 | Yahoo    | SIGN_IN   | 1347296347000 |
|     120 | Google   | SIGN_UP   | 1347296347000 |
|     120 | FaceBook | SIGN_IN   | 1347296347000 |
+---------+----------+-----------+---------------+

In this table I want to exclude already SIGN_UP ed "SIGN_IN" ed user count based on provider .

Show Create table

CREATE TABLE `signin_details` (
  `USER_ID` int(11) DEFAULT NULL,
  `PROVIDER` char(40) DEFAULT NULL,
  `OPERATION` char(40) DEFAULT NULL,
  `TIMESTAMP` bigint(20) DEFAULT NULL
) ENGINE=InnoDB

I am using this query .

select 
  count(distinct(USER_ID)) as signin_count, 
  PROVIDER from signin_details s1 
where 
  s1.USER_ID NOT IN 
  (
    select 
      USER_ID 
    from signin_details 
    where 
      signin_details.PROVIDER=s1.PROVIDER 
      and signin_details.OPERATION='SIGN_UP' 
      and signin_details.TIMESTAMP/1000 BETWEEN UNIX_TIMESTAMP(CURRENT_DATE()-INTERVAL 1 DAY) * 1000 AND UNIX_TIMESTAMP(CURRENT_DATE()) * 1000
  )  
  AND OPERATION='SIGN_IN' group by PROVIDER;

Explain Output:

+----+--------------------+----------------+------+---------------+------+---------+------+------+-----------------------------+
| id | select_type        | table          | type | possible_keys | key  | key_len | ref  | rows | Extra                       |
+----+--------------------+----------------+------+---------------+------+---------+------+------+-----------------------------+
|  1 | PRIMARY            | s1             | ALL  | NULL          | NULL | NULL    | NULL |    6 | Using where; Using filesort |
|  2 | DEPENDENT SUBQUERY | signin_details | ALL  | NULL          | NULL | NULL    | NULL |    6 | Using where                 |
+----+--------------------+----------------+------+---------------+------+---------+------+------+-----------------------------+

Query Output :

+--------------+----------+
| signin_count | PROVIDER |
+--------------+----------+
|            1 | FaceBook |
|            2 | Yahoo    |
+--------------+----------+

It takes more than 40 minutes to execute for 200k rows.

My assumption is it will check each row with total number of dependant subquery output.

My Assumption on this query.

 A -> Dependant Outputs (B,C,D) .
 A check with B
 A check with C
 A check with D

If dependant query output is larger it will take so long time to execute. How to improve this query?


Solution

  • If you use MySQL you have to know that sub queries performs awful slow.

    IN is slow...

    EXISTS is often faster then IN

    JOIN is mostly the fastest way do things like this.

    SELECT DISTINCT
      s1.PROVIDER,
      COUNT(DISTINCT s1.USER_ID)
    
    FROM 
      signin_details s1
      LEFT JOIN 
      (
        SELECT DISTINCT
          USER_ID, PROVIDER
        FROM 
          signin_details 
        WHERE
          signin_details.OPERATION='SIGN_UP' 
          AND 
            signin_details.TIMESTAMP 
              BETWEEN 
                UNIX_TIMESTAMP(CURRENT_DATE()-INTERVAL 1 DAY) * 1000 
                AND UNIX_TIMESTAMP(CURRENT_DATE()) * 1000
      ) AS t USING  (USER_ID, PROVIDER)
    
    WHERE
      t.USER_ID IS NULL
      AND OPERATION='SIGN_IN'
    GROUP BY s1.PROVIDER
    

    http://sqlfiddle.com/#!2/122ac/12

    NOTE: If you wonder about the sqlfiddle result consider here is a UNIX_TIMESTAMP in the query.

    Result:

    | PROVIDER | COUNT(DISTINCT S1.USER_ID) |
    -----------------------------------------
    | FaceBook |                          1 |
    |    Yahoo |                          2 |
    

    MySQL and the INTERSECT story. You get all combinations of USER_ID and PROVIDER which you don't want to count. Then LEFT JOIN them to your data. Now all the rows you want to count have no values from the LEFT JOIN. You get them by t.USER_ID IS NULL.


    Input:

    | rn° | USER_ID | PROVIDER | OPERATION |     TIMESTAMP |
    -------------------------------------------------------
    | 1   |     120 |   Google |   SIGN_UP | 1347296347000 | -
    | 2   |     120 |   Google |   SIGN_IN | 1347296347000 | - (see rn° 1)
    | 3   |     121 |    Yahoo |   SIGN_IN | 1347296347000 | Y
    | 4   |     122 |    Yahoo |   SIGN_IN | 1347296347000 | Y
    | 5   |     120 |   Google |   SIGN_UP | 1347296347000 | -
    | 6   |     120 | FaceBook |   SIGN_IN | 1347296347000 | F
    | 7   |     119 | FaceBook |   SIGN_IN | 1347296347000 | - (see rn° 8)
    | 8   |     119 | FaceBook |   SIGN_UP | 1347296347000 | -