I have a table history task of user, seems like this. 1 user have 9 task.
user_id task IS_COMPLETED updated_at
123 Task 1 1 2024-01-01
123 Task 2 1 2024-01-01
123 Task 3 0 2024-01-01
123 Task 4 0 2024-01-01
123 Task 5 0 2024-01-01
123 Task 6 1 2024-01-01
123 Task 7 1 2024-01-01
123 Task 8 1 2024-01-01
123 Task 9 1 2024-01-01
I want to grouping with possible combination of the task, to become table like this. So I know what kind of task combinations are quickly completed by users.
combination total_user_completed
task1_task2 20
task1_task3 15
task1_task_4 15
task1_task_5 14
: : (and so on for 2 combination)
task1_task2_task3 10
task4_task5_task_6 11
: : (and so on for 3 combination)
(for more detail about combination example: from A,B,C i got combination between A-B,B-C,A-C, and A-B-C. No need to be sequential)
I have tried with recursive sql but it didn't work well
"Recursive Join ran out of memory. Please re-run this query on a larger warehouse"
Then I also try to make it with case when, but it's not ideal i need to define one by one like this.
with all_task as (
select
distinct user_id,
task_name as task,
is_completed,
updated_at
from task
where
is_completed = 1
group by all
),
agg as (
select
distinct user_id,
activated_at as period,
max(case when task in ('Task 1') and IS_COMPLETED = 1 then date(updated_at) end) as task_1,
max(case when task in ('Task 2') and IS_COMPLETED = 1 then date(updated_at) end) as task_2,
max(case when task in ('Task 3') and IS_COMPLETED = 1 then date(updated_at) end) as task_3,
max(case when task in ('Task 4') and IS_COMPLETED = 1 then date(updated_at) end) as task_4,
max(case when task in ('Task 5') and IS_COMPLETED = 1 then date(updated_at) end) as task_5,
max(case when task in ('Task 6') and IS_COMPLETED = 1 then date(updated_at) end) as task_6,
max(case when task in ('Task 7') and IS_COMPLETED = 1 then date(updated_at) end) as task_7,
max(case when task in ('Task 8') and IS_COMPLETED = 1 then date(updated_at) end) as task_8,
max(case when task in ('Task 9') and IS_COMPLETED = 1 then date(updated_at) end) as task_9
from all_task
group by all
),
task_group as (
select user_id,
case when task_1 is not null and task_2 is not null then user_id end as task_1_2
case when task_2 not null and task_3 is not null then user_id end as task_2_3
.........
from agg
)
select 'Task1_task2' as combination,
count(distinct task_1_2) as total
from task_group
union all
select 'task2_task3' as combination,
count(distinct task_2_3) as total
from task_group
.... (and so on)
Is there any suggestion to solve this? Thanks a lot!
*Note for now I need at least 2/3 combination of task. Thank you
so with a CTE for test data:
with data(user_id, task, is_completed, updated_at) as (
select * from values
(123, 'Task 1', 1, '2024-01-01'::date),
(123, 'Task 2', 1, '2024-01-01'::date),
(123, 'Task 3', 0, '2024-01-01'::date),
(123, 'Task 4', 0, '2024-01-01'::date),
(123, 'Task 5', 0, '2024-01-01'::date),
(123, 'Task 6', 1, '2024-01-01'::date),
(123, 'Task 7', 1, '2024-01-01'::date),
(123, 'Task 8', 1, '2024-01-01'::date),
(123, 'Task 9', 1, '2024-01-01'::date)
)
and a CTE for the combinations of 2:
), combinations_2 as (
select
a.user_id,
a.task || '_' || b.task as combi
from data as a
join data as b
on a.user_id = b.user_id
--and a.updated_at = b.updated_at
and a.is_completed = 1 and b.is_completed = 1
and a.task < b.task
),
and a CTE for the 3's:
), combinations_3 as (
select
a.user_id,
a.task || '_' || b.task || '_' || c.task as combi
from data as a
join data as b
on a.user_id = b.user_id
--and a.updated_at = b.updated_at
and a.is_completed = 1 and b.is_completed = 1
and a.task < b.task
join data as c
on a.user_id = c.user_id
--and a.updated_at = c.updated_at
and a.is_completed = 1 and c.is_completed = 1
and b.task < c.task
),
union those together:
), merged as (
select * from combinations_2
union all
select * from combinations_3
)
then count:
select
combi,
count(user_id)
from merged
group by 1
order by 1;
Now you might want to use a CTE to filter the is_completed = 0
rows, but that might also cause sync between the uses, and cause performance loss. Another method would be to build the 2 combi's and and also keep the b.task
and then do the third round join against that, that might perform better.
I also assumed the updated_at are meaningless, and that there are not duplicate dates for the same ID/Task, thus again it might be better to preprocess the data to remove those. Thus:
with data(user_id, task, is_completed, updated_at) as (
select * from values
(123, 'Task 1', 1, '2024-01-01'::date),
(123, 'Task 2', 1, '2024-01-01'::date),
(123, 'Task 3', 0, '2024-01-01'::date),
(123, 'Task 4', 0, '2024-01-01'::date),
(123, 'Task 5', 0, '2024-01-01'::date),
(123, 'Task 6', 1, '2024-01-01'::date),
(123, 'Task 7', 1, '2024-01-01'::date),
(123, 'Task 8', 1, '2024-01-01'::date),
(123, 'Task 9', 1, '2024-01-01'::date)
), cleaned as (
select distinct user_id, task
from data
where is_completed = 1
), combinations_2 as (
select
a.user_id,
a.task || '_' || b.task as combi
from cleaned as a
join cleaned as b
on a.user_id = b.user_id
and a.task < b.task
), combinations_3 as (
select
a.user_id,
a.task || '_' || b.task || '_' || c.task as combi
from cleaned as a
join cleaned as b
on a.user_id = b.user_id
and a.task < b.task
join cleaned as c
on a.user_id = c.user_id
and b.task < c.task
), merged as (
select * from combinations_2
union all
select * from combinations_3
)
select
combi,
count(user_id)
from merged
group by 1
order by 1;