In a table, I have plan details of customers with their customer_id and enroll_date.
Now, I want to identify duplicate and valid enrollments from the overall data.
Duplicate: If a customer enrolls a plan, and then enrolls for another plan within 30 (<=30) days, then the second plan is considered as duplicate. Others are valid.
Example:
customer_id | enroll_date | type |
---|---|---|
2 | 2024-02-05 | valid |
2 | 2024-02-25 | duplicate |
1 | 2024-03-05 | valid |
1 | 2024-05-21 | valid |
1 | 2024-06-09 | duplicate |
1 | 2024-06-21 | valid |
1 | 2024-07-04 | duplicate |
1 | 2024-07-16 | duplicate |
1 | 2024-07-26 | valid |
1 | 2024-08-03 | duplicate |
1 | 2024-08-17 | duplicate |
1 | 2024-09-05 | valid |
Explanation:
Consider customer with customer_id = 1
The plan enrolled on 2024-03-05 is valid since its the first plan made by that customer.
Then, the next plan that is enrolled by the same customer is on 2024-05-21. Since its after 30 days since plan on 2024-03-05, it is valid.
Then, the next plan is on 2024-06-09 which is less than 30 days after the valid plan, which was on 2024-05-21. so it is a duplicate.
Then, the next plan is on 2024-06-21 which is more than 30 days since last valid plan, which was on 2024-05-21. so its valid.
Then, the next plan is on 2024-07-04 which is less than 30 days after the valid plan, which was on 2024-06-21. so it is a duplicate.
Then, the next plan is on 2024-07-16 which is less than 30 days after the valid plan, which was on 2024-06-21. so it is a duplicate.
Then the next plan is on 2024-07-26 which is more than 30 days since last valid plan, which was on 2024-06-21. so its valid. and so on...
I tried using lag window function, but it did not work as expected because it was considering the previous date only for comparison, whereas we need to compare each enroll_date with previous valid enroll_date.
WITH enrollments AS (
SELECT
customer_id,
enroll_date,
CASE
WHEN enroll_date - LAG(enroll_date) OVER (
PARTITION BY customer_id ORDER BY enroll_date
) <= 30 THEN
LAG(enroll_date) OVER (
PARTITION BY customer_id ORDER BY enroll_date
)
ELSE enroll_date
END
AS enroll_fill
FROM plan
where enroll_date is not null
)
SELECT
customer_id,
enroll_date,
enroll_fill,
CASE
WHEN enroll_date = enroll_fill THEN 'valid'
ELSE 'duplicate'
END AS type
FROM enrollments
But this wasn't giving the expected output.
customer_id | enroll_date | expected enroll_fill | expected type |
---|---|---|---|
1 | 2024-02-01 | 2024-02-01 | valid |
1 | 2024-02-15 | 2024-02-01 | duplicate |
1 | 2024-03-05 | 2024-03-05 | valid |
1 | 2024-05-21 | 2024-05-21 | valid |
1 | 2024-06-09 | 2024-05-21 | duplicate |
1 | 2024-06-21 | 2024-06-21 | valid |
1 | 2024-07-04 | 2024-06-21 | duplicate |
1 | 2024-07-16 | 2024-06-21 | duplicate |
1 | 2024-07-26 | 2024-07-26 | valid |
1 | 2024-08-03 | 2024-07-26 | duplicate |
1 | 2024-08-17 | 2024-07-26 | duplicate |
1 | 2024-09-05 | 2024-09-05 | valid |
The output given by above query:
customer_id | enroll_date | output enroll_fill | output type |
---|---|---|---|
1 | 2024-02-01 | 2024-02-01 | valid |
1 | 2024-02-15 | 2024-02-01 | duplicate |
1 | 2024-03-05 | 2024-03-05 | valid |
1 | 2024-05-21 | 2024-05-21 | valid |
1 | 2024-06-09 | 2024-05-21 | duplicate |
1 | 2024-06-21 | 2024-06-09 | duplicate |
1 | 2024-07-04 | 2024-06-21 | duplicate |
1 | 2024-07-16 | 2024-07-04 | duplicate |
1 | 2024-07-26 | 2024-07-16 | duplicate |
1 | 2024-08-03 | 2024-07-26 | duplicate |
1 | 2024-08-17 | 2024-08-03 | duplicate |
1 | 2024-09-05 | 2024-08-17 | duplicate |
Kindly suggest if there are any other ways to do this.
Similar to the answers suggested, I created my own solution to this which I'm posting here for anyone looking for the answer.
with recursive
enrollments as (
select
plan.customer_id,
plan.enroll_date,
row_number() over (
partition by plan.customer_id order by plan.enroll_date)
as rn
from plan
),
classified_enrollments(customer_id, enroll_date, rn, enroll_fill, enroll_type)
as (
select
customer_id,
enroll_date,
rn,
enroll_date as enroll_fill,
'valid'::text as enroll_type --first enroll is always valid
from enrollments
where rn = 1
union all
select
r.customer_id,
d.enroll_date,
d.rn,
case when d.enroll_date <= date(r.enroll_fill + interval '30d')
then r.enroll_fill
else d.enroll_date
end as enroll_fill,
case when d.enroll_date is null then null
when d.enroll_date <= date(r.enroll_fill + interval '30d')
then 'duplicate'::text
else 'valid'::text
end as enroll_type
from classified_enrollments r
join enrollments d on r.customer_id = d.customer_id and d.rn = r.rn + 1
)
SELECT
customer_id,
enroll_date,
enroll_fill,
enroll_type
FROM classified_enrollments;