I am working on a SQL query (Azure Databricks environment), where considering the following dataset:
clientid | visited | channel | purchase | visit_order |
---|---|---|---|---|
123 | abc133 | 0 | 1 | |
123 | efg446 | 0 | 2 | |
123 | gij729 | 1 | 3 | |
456 | klm183 | 0 | 1 | |
456 | nop496 | 0 | 2 | |
456 | qrs729 | 1 | 3 | |
456 | tuv894 | 0 | 1 | |
456 | wyz634 | 0 | 2 |
I want to get the following output:
clientid | user_journey | conversion |
---|---|---|
123 | google, facebook, instagram | 1 |
456 | google, linkedin, interest | 1 |
456 | google, instagram | 0 |
where the user_jorney column is composed of the channels that participated in a conversion journey. Note that the journey of users who, until then, have not made a purchase is also built.
Looking for commands that can help with this task, I found concat_ws, where I wrote the code below:
select
clientid,
concat_ws(',', collect_list(channel)) as user_journey,
sum(purchase) as conversion
from table_name group by clientid;
I get this result:
clientid | user_journey | conversion |
---|---|---|
123 | google, facebook, instagram | 1 |
456 | google, linkedin, pinterest, google, instagram | 1 |
Now I'm trying to consider a condition to get the desired result but so far I haven't been able to find.
Could you help me how can i solve this task?
I tried to repro your scenario,
where I gave sub query instead of original table in that query, I am selecting original table along with one extra column as row number where I am giving row number two every row partitioned by visit_order
column and orderd
by visited column.
My Query:
select
clientid,
concat_ws(',', collect_list(channel)) as user_journey,
sum(purchase) as conversion
from (SELECT *,ROW_NUMBER() OVER (PARTITION BY visit_order ORDER BY visited) AS RowNumber FROM docs) as docstb group by clientid , RowNumber order by clientid asc
Execution and Output: