sqlmysqljoingroup-bynotin

Effectively executing Conditional NOT IN Subquery with JOIN and GROUP BY


I need to adjust the following select statement to effectively filter out 523 records where 'first_date_purchased' is less than 'first_date_watched'. The following select statement returns 20778 records, of which 20255 should remain.

    SELECT se.student_id,
           si.date_registered,
           MIN(se.date_watched) AS first_date_watched,
           MIN(sp.date_purchased) AS first_date_purchased
    FROM student_engagement se
      LEFT JOIN student_info si ON se.student_id = si.student_id
      LEFT JOIN student_purchases sp ON se.student_id = sp.student_id
    GROUP BY se.student_id
    ORDER BY- first_date_purchased DESC
    -- Gives 20778 records

I have come up with the following solution, using NOT IN, but it has a substantial runtime of 1 minute. Initially I cancelled the process, but was advised by @Akina to include a sample dataset, on which both this query and another solution by @Raphi worked perfectly. My query is as follows:

    SELECT se.student_id,
           si.date_registered,
           MIN(se.date_watched) AS first_date_watched,
           MIN(sp.date_purchased) AS first_date_purchased
    FROM student_engagement se
      LEFT JOIN student_info si ON se.student_id = si.student_id
      LEFT JOIN student_purchases sp ON se.student_id = sp.student_id
    WHERE se.student_id NOT IN (SELECT sp2.student_id
                                FROM student_purchases sp2
                                WHERE sp2.date_purchased < se.date_watched)
    GROUP BY se.student_id
    ORDER BY- first_date_purchased DESC;
    -- Gives 20255 records

Although, I wanted to complete all filtering within a single select statement, I initially cast the unfiltered result of 20778 records into a new table called 'unfiltered_true_dataset' and got the desired filtering done with the following select statement based on that:

    SELECT *
    FROM unfiltered_true_dataset
    WHERE student_id NOT IN (SELECT student_id
                             FROM unfiltered_true_dataset
                             WHERE first_date_purchased < first_date_watched)
    -- Successfully obtained 20255 records

My question is whether there might be a more effective way to apply such a filter in a single select statement? Here is a sample dataset. The unfiltered select stetement returns 5 records, where the solution using the NOT IN subquery correctly returns only 4 records:

    CREATE TABLE `student_info` 
    (`student_id` INT NOT NULL, 
    `date_registered` DATE DEFAULT NULL, 
    PRIMARY KEY (`student_id`));
    
    INSERT INTO `student_info` VALUES
    (255193, '2021-12-01'),(255194, '2021-12-01'),(255196, '2021-12-01'),
    (255198, '2021-12-01'),(255199, '2021-12-01'),(255200, '2021-12-01'),
    (255201, '2021-12-01'),(255203, '2021-12-01'),(255204, '2021-12-01'),
    (255205, '2021-12-01');
    -- Full dataset has 40979 records
    
    CREATE TABLE `student_engagement` 
    (`student_id` INT DEFAULT NULL, 
    `date_watched` DATE DEFAULT NULL);
    
    INSERT INTO `student_engagement` VALUES
    (255200, '2021-12-01'),(255201, '2021-12-02'),(255201, '2021-12-06'),
    (255201, '2021-12-08'),(255201, '2022-10-05'),(255203, '2022-03-08'),
    (255203, '2022-03-17'),(255203, '2022-03-23'),(255203, '2021-12-01'),
    (255203, '2022-01-25'),(255203, '2022-01-26'),(255203, '2022-01-27'),
    (255203, '2022-02-08'),(255203, '2022-02-09'),(255203, '2022-02-21'),
    (255203, '2022-02-22'),(255203, '2022-02-23'),(255204, '2021-12-01'),
    (255204, '2021-12-02'),(255204, '2021-12-03'),(255204, '2021-12-14'),
    (255204, '2021-12-17'),(255204, '2021-12-07'),(255204, '2021-12-05'),
    (255204, '2021-12-09'),(255204, '2021-12-23'),(255204, '2022-01-11'),
    (255204, '2022-01-18'),(255204, '2022-03-05'),(255204, '2022-06-19'),
    (255204, '2022-07-06'),(255204, '2021-12-04'),(255204, '2021-12-18'),
    (255204, '2021-12-20'),(255204, '2021-12-29'),(255204, '2022-03-28'),
    (255204, '2022-03-31'),(255204, '2022-07-17'),(255204, '2022-07-21'),
    (255204, '2022-07-31'),(255204, '2021-12-12'),(255204, '2021-12-13'),
    (255204, '2021-12-21'),(255204, '2021-12-27'),(255204, '2022-01-07'),
    (255204, '2021-12-08'),(255204, '2022-01-27'),(255205, '2021-12-01'),
    (255205, '2021-12-02'),(255205, '2021-12-07'),(255205, '2022-01-22'),
    (255205, '2022-04-05');
    -- Full dataset has 74246 records
    
    CREATE TABLE `student_purchases` 
    (`purchase_id` INT NOT NULL, 
    `student_id` INT DEFAULT NULL, 
    `date_purchased` DATE DEFAULT NULL, 
    PRIMARY KEY (`purchase_id`));
    
    INSERT INTO `student_purchases` VALUES
    (15055, 255201, '2021-12-01'),(15057, 255203, '2021-12-01'),
    (15059, 255204, '2021-12-01'),(15060, 255196, '2021-12-01');
    -- Full dataset has 5922 records

The objective is to filter out the records where 'first_date_purchased' is less than 'first_date_watched' in a single select statement, in the most effective way possible. I'm writing this query as a student in Data Analysis and am still new to programming in general.

Thank you in advance for you assistance!


Solution

  • Consider:

    Using a HAVING clause which executes like a where clause AFTER the selected values and aggregation is complete.

    Note: Standard SQL is to include all non-aggregated selected objects in the GROUP BY clause. So, I added the date_registered to the group by. MySQL supports this syntax but it can be dangerous to use if you don't fully understand what it's doing. (Basically it's free to pick any value in the group even if those values are different) so if a student_Id could have different date_Registered for some reason, you could get different results in each run. STANDARD SQL is to always group by all values not aggregated in the select. Other RDBMS would not even allow your current query to run.

    Furthermore null evaluation in a in/not in is also suspect. Exists/not exists and correlation may be faster; but I don't see a reason to use it since I believe the Having clause will meet your need.

    Demo dbfiddle.uk

       SELECT se.student_id,
               si.date_registered,
               MIN(se.date_watched) AS first_date_watched,
               MIN(sp.date_purchased) AS first_date_purchased
        FROM student_engagement se
          LEFT JOIN student_info si ON se.student_id = si.student_id
          LEFT JOIN student_purchases sp ON se.student_id = sp.student_id
        GROUP BY se.student_id, si.date_registered
        HAVING first_date_purchased >= first_date_watched
         or first_date_purchased is null
        -- or use this in the having coalesce(first_date_purchased,first_date_watched) >= first_date_watched
        ORDER BY first_date_purchased DESC
    

    However, it's odd that you expect 4 records. Since first_date_purchased is null on two of the 5 records it would fail equality checks. Logically How is a Null value on first_date_purchased for students 255200, 255205 <= first_date_watched? but you seem to want them so we can either coalesce or use an or in the having.

    Either of these do not subjet you to a subquery or additional query as desired; but I'm unsure if either is more performant w/o a full dataset / knowledge of your indexes etc.