I need to adjust the following select statement to effectively filter out 523 records where 'first_date_purchased' is less than 'first_date_watched'. The following select statement returns 20778 records, of which 20255 should remain.
SELECT se.student_id,
si.date_registered,
MIN(se.date_watched) AS first_date_watched,
MIN(sp.date_purchased) AS first_date_purchased
FROM student_engagement se
LEFT JOIN student_info si ON se.student_id = si.student_id
LEFT JOIN student_purchases sp ON se.student_id = sp.student_id
GROUP BY se.student_id
ORDER BY- first_date_purchased DESC
-- Gives 20778 records
I have come up with the following solution, using NOT IN, but it has a substantial runtime of 1 minute. Initially I cancelled the process, but was advised by @Akina to include a sample dataset, on which both this query and another solution by @Raphi worked perfectly. My query is as follows:
SELECT se.student_id,
si.date_registered,
MIN(se.date_watched) AS first_date_watched,
MIN(sp.date_purchased) AS first_date_purchased
FROM student_engagement se
LEFT JOIN student_info si ON se.student_id = si.student_id
LEFT JOIN student_purchases sp ON se.student_id = sp.student_id
WHERE se.student_id NOT IN (SELECT sp2.student_id
FROM student_purchases sp2
WHERE sp2.date_purchased < se.date_watched)
GROUP BY se.student_id
ORDER BY- first_date_purchased DESC;
-- Gives 20255 records
Although, I wanted to complete all filtering within a single select statement, I initially cast the unfiltered result of 20778 records into a new table called 'unfiltered_true_dataset' and got the desired filtering done with the following select statement based on that:
SELECT *
FROM unfiltered_true_dataset
WHERE student_id NOT IN (SELECT student_id
FROM unfiltered_true_dataset
WHERE first_date_purchased < first_date_watched)
-- Successfully obtained 20255 records
My question is whether there might be a more effective way to apply such a filter in a single select statement? Here is a sample dataset. The unfiltered select stetement returns 5 records, where the solution using the NOT IN subquery correctly returns only 4 records:
CREATE TABLE `student_info`
(`student_id` INT NOT NULL,
`date_registered` DATE DEFAULT NULL,
PRIMARY KEY (`student_id`));
INSERT INTO `student_info` VALUES
(255193, '2021-12-01'),(255194, '2021-12-01'),(255196, '2021-12-01'),
(255198, '2021-12-01'),(255199, '2021-12-01'),(255200, '2021-12-01'),
(255201, '2021-12-01'),(255203, '2021-12-01'),(255204, '2021-12-01'),
(255205, '2021-12-01');
-- Full dataset has 40979 records
CREATE TABLE `student_engagement`
(`student_id` INT DEFAULT NULL,
`date_watched` DATE DEFAULT NULL);
INSERT INTO `student_engagement` VALUES
(255200, '2021-12-01'),(255201, '2021-12-02'),(255201, '2021-12-06'),
(255201, '2021-12-08'),(255201, '2022-10-05'),(255203, '2022-03-08'),
(255203, '2022-03-17'),(255203, '2022-03-23'),(255203, '2021-12-01'),
(255203, '2022-01-25'),(255203, '2022-01-26'),(255203, '2022-01-27'),
(255203, '2022-02-08'),(255203, '2022-02-09'),(255203, '2022-02-21'),
(255203, '2022-02-22'),(255203, '2022-02-23'),(255204, '2021-12-01'),
(255204, '2021-12-02'),(255204, '2021-12-03'),(255204, '2021-12-14'),
(255204, '2021-12-17'),(255204, '2021-12-07'),(255204, '2021-12-05'),
(255204, '2021-12-09'),(255204, '2021-12-23'),(255204, '2022-01-11'),
(255204, '2022-01-18'),(255204, '2022-03-05'),(255204, '2022-06-19'),
(255204, '2022-07-06'),(255204, '2021-12-04'),(255204, '2021-12-18'),
(255204, '2021-12-20'),(255204, '2021-12-29'),(255204, '2022-03-28'),
(255204, '2022-03-31'),(255204, '2022-07-17'),(255204, '2022-07-21'),
(255204, '2022-07-31'),(255204, '2021-12-12'),(255204, '2021-12-13'),
(255204, '2021-12-21'),(255204, '2021-12-27'),(255204, '2022-01-07'),
(255204, '2021-12-08'),(255204, '2022-01-27'),(255205, '2021-12-01'),
(255205, '2021-12-02'),(255205, '2021-12-07'),(255205, '2022-01-22'),
(255205, '2022-04-05');
-- Full dataset has 74246 records
CREATE TABLE `student_purchases`
(`purchase_id` INT NOT NULL,
`student_id` INT DEFAULT NULL,
`date_purchased` DATE DEFAULT NULL,
PRIMARY KEY (`purchase_id`));
INSERT INTO `student_purchases` VALUES
(15055, 255201, '2021-12-01'),(15057, 255203, '2021-12-01'),
(15059, 255204, '2021-12-01'),(15060, 255196, '2021-12-01');
-- Full dataset has 5922 records
The objective is to filter out the records where 'first_date_purchased' is less than 'first_date_watched' in a single select statement, in the most effective way possible. I'm writing this query as a student in Data Analysis and am still new to programming in general.
Thank you in advance for you assistance!
Consider:
Using a HAVING clause which executes like a where clause AFTER the selected values and aggregation is complete.
Note: Standard SQL is to include all non-aggregated selected objects in the GROUP BY clause. So, I added the date_registered to the group by. MySQL supports this syntax but it can be dangerous to use if you don't fully understand what it's doing. (Basically it's free to pick any value in the group even if those values are different) so if a student_Id could have different date_Registered for some reason, you could get different results in each run. STANDARD SQL is to always group by all values not aggregated in the select. Other RDBMS would not even allow your current query to run.
Furthermore null evaluation in a in/not in is also suspect. Exists/not exists and correlation may be faster; but I don't see a reason to use it since I believe the Having clause will meet your need.
SELECT se.student_id,
si.date_registered,
MIN(se.date_watched) AS first_date_watched,
MIN(sp.date_purchased) AS first_date_purchased
FROM student_engagement se
LEFT JOIN student_info si ON se.student_id = si.student_id
LEFT JOIN student_purchases sp ON se.student_id = sp.student_id
GROUP BY se.student_id, si.date_registered
HAVING first_date_purchased >= first_date_watched
or first_date_purchased is null
-- or use this in the having coalesce(first_date_purchased,first_date_watched) >= first_date_watched
ORDER BY first_date_purchased DESC
However, it's odd that you expect 4 records. Since first_date_purchased is null on two of the 5 records it would fail equality checks. Logically How is a Null value on first_date_purchased for students 255200, 255205 <= first_date_watched? but you seem to want them so we can either coalesce or use an or in the having.
Either of these do not subjet you to a subquery or additional query as desired; but I'm unsure if either is more performant w/o a full dataset / knowledge of your indexes etc.