I have a dataset with conversation data between a User and a Customer Service. I need to find the response time (time difference) between after a user sent a message and first time the Customer Service responded to the message
What I struggle with is a user can send multiple consecutive message before the Customer Service can reply. If a user sent multiple consecutive message, I need to take the time of the last row, but if the customer service sent multiple consecutive message, I need to take the time of the first row
My expected result is as below:
So, from the 10 message_id in the base table, there will be 3 rows:
I've tried using lead function as follows but the result is not what I expected
select sender,
start_message,
start_reply,
date_diff('second',start_message,start_reply)/60.0 as response_time
from (
select message_id,
created_at as start_message,
lead(created_at) over(partition by sender order by created_at) as start_reply,
sender_type
from base_table
) t
order by 2 asc
I'm using athena, any help is greatly appreciated, thank you
One approach is to use a flag that is set to one whenever the sender is 'user' and the previous sender is 'cs', then use a running sum on that flag to group every two pairs of consecutive (user, cs) as a single group.
with t as
(
select *,
case when
sender ='user' and lag(sender, 1, sender) over (order by created_at) = 'cs'
then 1 else 0
end as flag
from tbl_name
),
grps as
(
select *,
sum(flag) over (order by created_at) as grp
from t
)
select max(case when sender='user' then message_id end) as message_id_user,
min(case when sender='cs' then message_id end) as message_id_cs,
date_diff('second', max(case when sender='user' then created_at end), min(case when sender='cs' then created_at end)) as response_time
from grps
group by grp
order by grp