sqlsnowflake-cloud-data-platformlogicmethod-combination

How to get combination result by one column history sql snowflake


I have a table history task of user, seems like this. 1 user have 9 task.

user_id     task    IS_COMPLETED updated_at
123       Task 1    1            2024-01-01
123       Task 2    1            2024-01-01
123       Task 3    0            2024-01-01
123       Task 4    0            2024-01-01
123       Task 5    0            2024-01-01
123       Task 6    1            2024-01-01
123       Task 7    1            2024-01-01
123       Task 8    1            2024-01-01
123       Task 9    1            2024-01-01

I want to grouping with possible combination of the task, to become table like this. So I know what kind of task combinations are quickly completed by users.

combination     total_user_completed
task1_task2         20
task1_task3         15 
task1_task_4        15
task1_task_5        14
:                    : (and so on for 2 combination) 
task1_task2_task3   10
task4_task5_task_6  11
:                   : (and so on for 3 combination) 

(for more detail about combination example: from A,B,C i got combination between A-B,B-C,A-C, and A-B-C. No need to be sequential)

I have tried with recursive sql but it didn't work well

"Recursive Join ran out of memory. Please re-run this query on a larger warehouse"

Then I also try to make it with case when, but it's not ideal i need to define one by one like this.

    with all_task as (
            select 
                distinct user_id,
                task_name as task, 
                is_completed,
                updated_at
            from task
            where 
                is_completed = 1
            group by all
          ),
        agg as (
            select 
                distinct user_id, 
                activated_at as period,
                max(case when task in ('Task 1') and IS_COMPLETED = 1 then date(updated_at) end) as task_1,
                max(case when task in ('Task 2') and IS_COMPLETED = 1 then date(updated_at) end) as task_2,
                max(case when task in ('Task 3') and IS_COMPLETED = 1 then date(updated_at) end) as task_3,
                max(case when task in ('Task 4') and IS_COMPLETED = 1 then date(updated_at) end) as task_4,
                max(case when task in ('Task 5') and IS_COMPLETED = 1 then date(updated_at) end) as task_5,
                max(case when task in ('Task 6') and IS_COMPLETED = 1 then date(updated_at) end) as task_6,
                max(case when task in ('Task 7') and IS_COMPLETED = 1 then date(updated_at) end) as task_7,
                max(case when task in ('Task 8') and IS_COMPLETED = 1 then date(updated_at) end) as task_8,
                max(case when task in ('Task 9') and IS_COMPLETED = 1 then date(updated_at) end) as task_9
            from all_task 
            group by all
            ),
            task_group as (
            select user_id, 
                case when task_1 is not null and task_2 is not null then user_id end as task_1_2
                case when task_2 not null and task_3 is not null then user_id end as task_2_3
                .........
            from agg
             )
            select 'Task1_task2' as combination, 
            count(distinct task_1_2) as total
            from task_group

            union all 

            select 'task2_task3' as combination, 
            count(distinct task_2_3) as total
            from task_group
.... (and so on)

Is there any suggestion to solve this? Thanks a lot!

*Note for now I need at least 2/3 combination of task. Thank you


Solution

  • so with a CTE for test data:

    with data(user_id, task, is_completed, updated_at) as (
        select * from values
        (123, 'Task 1', 1, '2024-01-01'::date),
        (123, 'Task 2', 1, '2024-01-01'::date),
        (123, 'Task 3', 0, '2024-01-01'::date),
        (123, 'Task 4', 0, '2024-01-01'::date),
        (123, 'Task 5', 0, '2024-01-01'::date),
        (123, 'Task 6', 1, '2024-01-01'::date),
        (123, 'Task 7', 1, '2024-01-01'::date),
        (123, 'Task 8', 1, '2024-01-01'::date),
        (123, 'Task 9', 1, '2024-01-01'::date)
    )
    

    and a CTE for the combinations of 2:

    ), combinations_2 as (
        select
            a.user_id, 
            a.task || '_' || b.task as combi
        from data as a
        join data as b
            on a.user_id = b.user_id 
            --and a.updated_at = b.updated_at
            and a.is_completed = 1 and b.is_completed = 1
            and a.task < b.task
    ),
    

    enter image description here

    and a CTE for the 3's:

    ), combinations_3 as (
        select
            a.user_id, 
            a.task || '_' || b.task || '_' || c.task as combi
        from data as a
        join data as b
            on a.user_id = b.user_id 
            --and a.updated_at = b.updated_at
            and a.is_completed = 1 and b.is_completed = 1
            and a.task < b.task
        join data as c
            on a.user_id = c.user_id 
            --and a.updated_at = c.updated_at
            and a.is_completed = 1 and c.is_completed = 1
            and b.task < c.task
    ),
    

    enter image description here

    union those together:

    ), merged as (
        select * from combinations_2
        union all 
        select * from combinations_3
    )
    

    then count:

    select 
        combi, 
        count(user_id) 
    from merged
    group by 1
    order by 1;
    

    enter image description here

    Now you might want to use a CTE to filter the is_completed = 0 rows, but that might also cause sync between the uses, and cause performance loss. Another method would be to build the 2 combi's and and also keep the b.task and then do the third round join against that, that might perform better.

    I also assumed the updated_at are meaningless, and that there are not duplicate dates for the same ID/Task, thus again it might be better to preprocess the data to remove those. Thus:

    with data(user_id, task, is_completed, updated_at) as (
        select * from values
        (123, 'Task 1', 1, '2024-01-01'::date),
        (123, 'Task 2', 1, '2024-01-01'::date),
        (123, 'Task 3', 0, '2024-01-01'::date),
        (123, 'Task 4', 0, '2024-01-01'::date),
        (123, 'Task 5', 0, '2024-01-01'::date),
        (123, 'Task 6', 1, '2024-01-01'::date),
        (123, 'Task 7', 1, '2024-01-01'::date),
        (123, 'Task 8', 1, '2024-01-01'::date),
        (123, 'Task 9', 1, '2024-01-01'::date)
    ), cleaned as (
        select distinct user_id, task
        from data
        where is_completed = 1
    ), combinations_2 as (
        select
            a.user_id, 
            a.task || '_' || b.task as combi
        from cleaned as a
        join cleaned as b
            on a.user_id = b.user_id 
            and a.task < b.task
    ), combinations_3 as (
        select
            a.user_id, 
            a.task || '_' || b.task || '_' || c.task as combi
        from cleaned as a
        join cleaned as b
            on a.user_id = b.user_id 
            and a.task < b.task
        join cleaned as c
            on a.user_id = c.user_id 
            and b.task < c.task
    ), merged as (
        select * from combinations_2
        union all 
        select * from combinations_3
    )
    select 
        combi, 
        count(user_id) 
    from merged
    group by 1
    order by 1;