sqlamazon-redshiftwindow-functions

Identify duplicates within a period of time using Redshift SQL


In a table, I have plan details of customers with their customer_id and enroll_date.

Now, I want to identify duplicate and valid enrollments from the overall data.

Duplicate: If a customer enrolls a plan, and then enrolls for another plan within 30 (<=30) days, then the second plan is considered as duplicate. Others are valid.

Example:

customer_id enroll_date type
2 2024-02-05 valid
2 2024-02-25 duplicate
1 2024-03-05 valid
1 2024-05-21 valid
1 2024-06-09 duplicate
1 2024-06-21 valid
1 2024-07-04 duplicate
1 2024-07-16 duplicate
1 2024-07-26 valid
1 2024-08-03 duplicate
1 2024-08-17 duplicate
1 2024-09-05 valid

Explanation:

Consider customer with customer_id = 1

The plan enrolled on 2024-03-05 is valid since its the first plan made by that customer.

Then, the next plan that is enrolled by the same customer is on 2024-05-21. Since its after 30 days since plan on 2024-03-05, it is valid.

Then, the next plan is on 2024-06-09 which is less than 30 days after the valid plan, which was on 2024-05-21. so it is a duplicate.

Then, the next plan is on 2024-06-21 which is more than 30 days since last valid plan, which was on 2024-05-21. so its valid.

Then, the next plan is on 2024-07-04 which is less than 30 days after the valid plan, which was on 2024-06-21. so it is a duplicate.

Then, the next plan is on 2024-07-16 which is less than 30 days after the valid plan, which was on 2024-06-21. so it is a duplicate.

Then the next plan is on 2024-07-26 which is more than 30 days since last valid plan, which was on 2024-06-21. so its valid. and so on...

I tried using lag window function, but it did not work as expected because it was considering the previous date only for comparison, whereas we need to compare each enroll_date with previous valid enroll_date.

WITH enrollments AS (
    SELECT 
        customer_id, 
        enroll_date,
        CASE 
            WHEN enroll_date - LAG(enroll_date) OVER (
                PARTITION BY customer_id ORDER BY enroll_date
            ) <= 30 THEN 
                LAG(enroll_date) OVER (
                    PARTITION BY customer_id ORDER BY enroll_date
                )
            ELSE enroll_date
        END 
        AS enroll_fill
    FROM plan
    where enroll_date is not null
)

SELECT 
    customer_id, 
    enroll_date, 
    enroll_fill,
    CASE 
        WHEN enroll_date = enroll_fill THEN 'valid' 
        ELSE 'duplicate' 
    END AS type
FROM enrollments

But this wasn't giving the expected output.

customer_id enroll_date expected enroll_fill expected type
1 2024-02-01 2024-02-01 valid
1 2024-02-15 2024-02-01 duplicate
1 2024-03-05 2024-03-05 valid
1 2024-05-21 2024-05-21 valid
1 2024-06-09 2024-05-21 duplicate
1 2024-06-21 2024-06-21 valid
1 2024-07-04 2024-06-21 duplicate
1 2024-07-16 2024-06-21 duplicate
1 2024-07-26 2024-07-26 valid
1 2024-08-03 2024-07-26 duplicate
1 2024-08-17 2024-07-26 duplicate
1 2024-09-05 2024-09-05 valid

The output given by above query:

customer_id enroll_date output enroll_fill output type
1 2024-02-01 2024-02-01 valid
1 2024-02-15 2024-02-01 duplicate
1 2024-03-05 2024-03-05 valid
1 2024-05-21 2024-05-21 valid
1 2024-06-09 2024-05-21 duplicate
1 2024-06-21 2024-06-09 duplicate
1 2024-07-04 2024-06-21 duplicate
1 2024-07-16 2024-07-04 duplicate
1 2024-07-26 2024-07-16 duplicate
1 2024-08-03 2024-07-26 duplicate
1 2024-08-17 2024-08-03 duplicate
1 2024-09-05 2024-08-17 duplicate

Kindly suggest if there are any other ways to do this.


Solution

  • Similar to the answers suggested, I created my own solution to this which I'm posting here for anyone looking for the answer.

    with recursive 
        enrollments as (
            select 
                plan.customer_id,  
                plan.enroll_date,
                row_number() over (
                    partition by plan.customer_id order by plan.enroll_date) 
                as rn
            from plan
        ),
        classified_enrollments(customer_id, enroll_date, rn, enroll_fill, enroll_type) 
        as (
            select 
                customer_id, 
                enroll_date, 
                rn, 
                enroll_date as enroll_fill,
                'valid'::text as enroll_type --first enroll is always valid
            from enrollments
            where rn = 1
            
            union all
            
            select 
                r.customer_id, 
                d.enroll_date, 
                d.rn, 
                case when d.enroll_date <= date(r.enroll_fill + interval '30d') 
                    then r.enroll_fill
                    else d.enroll_date
                end as enroll_fill,
                case when d.enroll_date is null then null
                    when d.enroll_date <= date(r.enroll_fill + interval '30d') 
                    then 'duplicate'::text
                    else 'valid'::text
                end as enroll_type
            from classified_enrollments r
            join enrollments d on r.customer_id = d.customer_id and d.rn = r.rn + 1
        )
        SELECT
            customer_id, 
            enroll_date, 
            enroll_fill, 
            enroll_type
        FROM classified_enrollments;