sql snowflake-cloud-data-platform logic method-combination

How to get combination result by one column history sql snowflake

I have a table history task of user, seems like this. 1 user have 9 task.

user_id     task    IS_COMPLETED updated_at
123       Task 1    1            2024-01-01
123       Task 2    1            2024-01-01
123       Task 3    0            2024-01-01
123       Task 4    0            2024-01-01
123       Task 5    0            2024-01-01
123       Task 6    1            2024-01-01
123       Task 7    1            2024-01-01
123       Task 8    1            2024-01-01
123       Task 9    1            2024-01-01

I want to grouping with possible combination of the task, to become table like this. So I know what kind of task combinations are quickly completed by users.

combination     total_user_completed
task1_task2         20
task1_task3         15 
task1_task_4        15
task1_task_5        14
:                    : (and so on for 2 combination) 
task1_task2_task3   10
task4_task5_task_6  11
:                   : (and so on for 3 combination)

(for more detail about combination example: from A,B,C i got combination between A-B,B-C,A-C, and A-B-C. No need to be sequential)

I have tried with recursive sql but it didn't work well

"Recursive Join ran out of memory. Please re-run this query on a larger warehouse"

Then I also try to make it with case when, but it's not ideal i need to define one by one like this.

    with all_task as (
            select 
                distinct user_id,
                task_name as task, 
                is_completed,
                updated_at
            from task
            where 
                is_completed = 1
            group by all
          ),
        agg as (
            select 
                distinct user_id, 
                activated_at as period,
                max(case when task in ('Task 1') and IS_COMPLETED = 1 then date(updated_at) end) as task_1,
                max(case when task in ('Task 2') and IS_COMPLETED = 1 then date(updated_at) end) as task_2,
                max(case when task in ('Task 3') and IS_COMPLETED = 1 then date(updated_at) end) as task_3,
                max(case when task in ('Task 4') and IS_COMPLETED = 1 then date(updated_at) end) as task_4,
                max(case when task in ('Task 5') and IS_COMPLETED = 1 then date(updated_at) end) as task_5,
                max(case when task in ('Task 6') and IS_COMPLETED = 1 then date(updated_at) end) as task_6,
                max(case when task in ('Task 7') and IS_COMPLETED = 1 then date(updated_at) end) as task_7,
                max(case when task in ('Task 8') and IS_COMPLETED = 1 then date(updated_at) end) as task_8,
                max(case when task in ('Task 9') and IS_COMPLETED = 1 then date(updated_at) end) as task_9
            from all_task 
            group by all
            ),
            task_group as (
            select user_id, 
                case when task_1 is not null and task_2 is not null then user_id end as task_1_2
                case when task_2 not null and task_3 is not null then user_id end as task_2_3
                .........
            from agg
             )
            select 'Task1_task2' as combination, 
            count(distinct task_1_2) as total
            from task_group

            union all 

            select 'task2_task3' as combination, 
            count(distinct task_2_3) as total
            from task_group
.... (and so on)

Is there any suggestion to solve this? Thanks a lot!

*Note for now I need at least 2/3 combination of task. Thank you

Solution

so with a CTE for test data:

with data(user_id, task, is_completed, updated_at) as (
    select * from values
    (123, 'Task 1', 1, '2024-01-01'::date),
    (123, 'Task 2', 1, '2024-01-01'::date),
    (123, 'Task 3', 0, '2024-01-01'::date),
    (123, 'Task 4', 0, '2024-01-01'::date),
    (123, 'Task 5', 0, '2024-01-01'::date),
    (123, 'Task 6', 1, '2024-01-01'::date),
    (123, 'Task 7', 1, '2024-01-01'::date),
    (123, 'Task 8', 1, '2024-01-01'::date),
    (123, 'Task 9', 1, '2024-01-01'::date)
)

and a CTE for the combinations of 2:

), combinations_2 as (
    select
        a.user_id, 
        a.task || '_' || b.task as combi
    from data as a
    join data as b
        on a.user_id = b.user_id 
        --and a.updated_at = b.updated_at
        and a.is_completed = 1 and b.is_completed = 1
        and a.task < b.task
),

and a CTE for the 3's:

), combinations_3 as (
    select
        a.user_id, 
        a.task || '_' || b.task || '_' || c.task as combi
    from data as a
    join data as b
        on a.user_id = b.user_id 
        --and a.updated_at = b.updated_at
        and a.is_completed = 1 and b.is_completed = 1
        and a.task < b.task
    join data as c
        on a.user_id = c.user_id 
        --and a.updated_at = c.updated_at
        and a.is_completed = 1 and c.is_completed = 1
        and b.task < c.task
),

union those together:

), merged as (
    select * from combinations_2
    union all 
    select * from combinations_3
)

then count:

select 
    combi, 
    count(user_id) 
from merged
group by 1
order by 1;

Now you might want to use a CTE to filter the is_completed = 0 rows, but that might also cause sync between the uses, and cause performance loss. Another method would be to build the 2 combi's and and also keep the b.task and then do the third round join against that, that might perform better.

I also assumed the updated_at are meaningless, and that there are not duplicate dates for the same ID/Task, thus again it might be better to preprocess the data to remove those. Thus:

with data(user_id, task, is_completed, updated_at) as (
    select * from values
    (123, 'Task 1', 1, '2024-01-01'::date),
    (123, 'Task 2', 1, '2024-01-01'::date),
    (123, 'Task 3', 0, '2024-01-01'::date),
    (123, 'Task 4', 0, '2024-01-01'::date),
    (123, 'Task 5', 0, '2024-01-01'::date),
    (123, 'Task 6', 1, '2024-01-01'::date),
    (123, 'Task 7', 1, '2024-01-01'::date),
    (123, 'Task 8', 1, '2024-01-01'::date),
    (123, 'Task 9', 1, '2024-01-01'::date)
), cleaned as (
    select distinct user_id, task
    from data
    where is_completed = 1
), combinations_2 as (
    select
        a.user_id, 
        a.task || '_' || b.task as combi
    from cleaned as a
    join cleaned as b
        on a.user_id = b.user_id 
        and a.task < b.task
), combinations_3 as (
    select
        a.user_id, 
        a.task || '_' || b.task || '_' || c.task as combi
    from cleaned as a
    join cleaned as b
        on a.user_id = b.user_id 
        and a.task < b.task
    join cleaned as c
        on a.user_id = c.user_id 
        and b.task < c.task
), merged as (
    select * from combinations_2
    union all 
    select * from combinations_3
)
select 
    combi, 
    count(user_id) 
from merged
group by 1
order by 1;