sqlmaxprestotrino

In Presto sql, how to grab the maximum timestamp in between two events within a table


I have a table df like below, my goal is to find the largest 'focus' timestamp after each visit for each [user:session] pair. (Timestamp as BIGINT data type and in ascending order)

event user_id session_id timestamp
visit a b1 t1
focus a b1 t2
focus a b1 t3
visit a b1 t4
focus a b1 t5
focus a b1 t6
focus a b1 t7
visit a b1 t8

I wish final output could be something like

user_id session_id visit_timestamp max_focus_timestamp
a b1 t1 t3
a b1 t4 t7

Curious how can I achieve that?

This is what I have tried, but last column (max_focus_timestamp) is always Null, curious how to fix that? Thanks!

with visit as 
(
select 
  user_id,
  session_id,
  timestamp as visit_timestamp,
  lead(timestamp) IGNORE NULLS over (PARTITION by user_id, session_id ORDER BY timestamp) as next_visit_timestamp 
  
from df 
where event = 'visit' 
),
focus as 
(
select 
  user_id,
  session_id,
  timestamp as focus_timestamp
  
from df 
where event != 'visit' 
)
select 
  distinct 
  v.user_id,
  v.session_id,
  v.visit_timestamp,
  max(f.focus_timestamp) as max_focus_timestamp
  
from visit v left join focus f on v.user_id = f.user_id and v.session_id = f.session_id 
 and f.focus_timestamp between v.visit_timestamp and v.next_visit_timestamp 
 
group by 1,2,3

Solution

  • You can use the gaps-and-islands approach (or one very similar to it). You create "groups" via window function based on count of visit encountered and then group it into final result:

    -- sample data
    with dataset(event, user_id, session_id, timestamp) as(
        values  ('visit', 'a', 'b1', 't1'),
        ('focus', 'a', 'b1', 't2'),
        ('focus', 'a', 'b1', 't3'),
        ('visit', 'a', 'b1', 't4'),
        ('focus', 'a', 'b1', 't5'),
        ('focus', 'a', 'b1', 't6'),
        ('focus', 'a', 'b1', 't7'),
        ('visit', 'a', 'b1', 't8')
    )
    
    -- query
    select user_id,
           session_id,
           min(timestamp) visit_timestamp, -- group starts with visit
           max(timestamp) max_focus_timestamp -- last entry in the group
    from (
        -- subquery to build "groups"
        select *,
            sum(if(event = 'visit', 1)) over (partition by user_id, session_id order by timestamp) gr
        from dataset
    )
    group by user_id, session_id, gr
    having count(*) > 1;
    

    Output:

    user_id session_id visit_timestamp max_focus_timestamp
    a b1 t1 t3
    a b1 t4 t7

    Another option is to use MATCH_RECOGNIZE (not available in Presto AFAIK, only in Trino):

    -- query
    select *
    from dataset
        MATCH_RECOGNIZE(
            PARTITION BY user_id, session_id
            ORDER BY timestamp
            MEASURES
                A.timestamp AS visit_timestamp,
                LAST(B.timestamp) AS max_focus_timestamp
            ONE ROW PER MATCH -- default can be skipped
            AFTER MATCH SKIP PAST LAST ROW -- arguably self-explanatory
            PATTERN (A B+) -- pattern, A is the "visit", followed by one or more records which are not visits
            DEFINE
                A AS event = 'visit',
                B AS event != 'visit'
            );