I have a table df
like below, my goal is to find the largest 'focus' timestamp after each visit for each [user:session] pair. (Timestamp as BIGINT data type and in ascending order)
event | user_id | session_id | timestamp |
---|---|---|---|
visit | a | b1 | t1 |
focus | a | b1 | t2 |
focus | a | b1 | t3 |
visit | a | b1 | t4 |
focus | a | b1 | t5 |
focus | a | b1 | t6 |
focus | a | b1 | t7 |
visit | a | b1 | t8 |
I wish final output could be something like
user_id | session_id | visit_timestamp | max_focus_timestamp |
---|---|---|---|
a | b1 | t1 | t3 |
a | b1 | t4 | t7 |
Curious how can I achieve that?
This is what I have tried, but last column (max_focus_timestamp) is always Null, curious how to fix that? Thanks!
with visit as
(
select
user_id,
session_id,
timestamp as visit_timestamp,
lead(timestamp) IGNORE NULLS over (PARTITION by user_id, session_id ORDER BY timestamp) as next_visit_timestamp
from df
where event = 'visit'
),
focus as
(
select
user_id,
session_id,
timestamp as focus_timestamp
from df
where event != 'visit'
)
select
distinct
v.user_id,
v.session_id,
v.visit_timestamp,
max(f.focus_timestamp) as max_focus_timestamp
from visit v left join focus f on v.user_id = f.user_id and v.session_id = f.session_id
and f.focus_timestamp between v.visit_timestamp and v.next_visit_timestamp
group by 1,2,3
You can use the gaps-and-islands approach (or one very similar to it). You create "groups" via window function based on count of visit
encountered and then group it into final result:
-- sample data
with dataset(event, user_id, session_id, timestamp) as(
values ('visit', 'a', 'b1', 't1'),
('focus', 'a', 'b1', 't2'),
('focus', 'a', 'b1', 't3'),
('visit', 'a', 'b1', 't4'),
('focus', 'a', 'b1', 't5'),
('focus', 'a', 'b1', 't6'),
('focus', 'a', 'b1', 't7'),
('visit', 'a', 'b1', 't8')
)
-- query
select user_id,
session_id,
min(timestamp) visit_timestamp, -- group starts with visit
max(timestamp) max_focus_timestamp -- last entry in the group
from (
-- subquery to build "groups"
select *,
sum(if(event = 'visit', 1)) over (partition by user_id, session_id order by timestamp) gr
from dataset
)
group by user_id, session_id, gr
having count(*) > 1;
Output:
user_id | session_id | visit_timestamp | max_focus_timestamp |
---|---|---|---|
a | b1 | t1 | t3 |
a | b1 | t4 | t7 |
Another option is to use MATCH_RECOGNIZE
(not available in Presto AFAIK, only in Trino):
-- query
select *
from dataset
MATCH_RECOGNIZE(
PARTITION BY user_id, session_id
ORDER BY timestamp
MEASURES
A.timestamp AS visit_timestamp,
LAST(B.timestamp) AS max_focus_timestamp
ONE ROW PER MATCH -- default can be skipped
AFTER MATCH SKIP PAST LAST ROW -- arguably self-explanatory
PATTERN (A B+) -- pattern, A is the "visit", followed by one or more records which are not visits
DEFINE
A AS event = 'visit',
B AS event != 'visit'
);