sqltimeamazon-athenalead

Find time difference between non-consecutive rows depending on column values in Athena


I have a dataset with conversation data between a User and a Customer Service. I need to find the response time (time difference) between after a user sent a message and first time the Customer Service responded to the message

enter image description here

What I struggle with is a user can send multiple consecutive message before the Customer Service can reply. If a user sent multiple consecutive message, I need to take the time of the last row, but if the customer service sent multiple consecutive message, I need to take the time of the first row

My expected result is as below:

enter image description here

So, from the 10 message_id in the base table, there will be 3 rows:

I've tried using lead function as follows but the result is not what I expected

 select sender,
        start_message,
        start_reply,
        date_diff('second',start_message,start_reply)/60.0 as response_time
from (
        select message_id,
                created_at as start_message,
                  lead(created_at) over(partition by sender  order by created_at) as start_reply,
                  sender_type
          from base_table
     ) t 
 order by 2 asc

I'm using athena, any help is greatly appreciated, thank you


Solution

  • One approach is to use a flag that is set to one whenever the sender is 'user' and the previous sender is 'cs', then use a running sum on that flag to group every two pairs of consecutive (user, cs) as a single group.

    with t as
    (
      select *,
        case when 
          sender ='user' and lag(sender, 1, sender) over (order by created_at) = 'cs' 
          then 1 else 0 
        end as flag
     from tbl_name
    ),
    grps as
    (
      select *, 
        sum(flag) over (order by created_at) as grp
      from t
    )
    select max(case when sender='user' then message_id end) as message_id_user,
           min(case when sender='cs' then message_id end) as message_id_cs, 
           date_diff('second', max(case when sender='user' then created_at end), min(case when sender='cs' then created_at end)) as response_time
    from grps
    group by grp
    order by grp