I need to track the sequence of sites visited by a user in their session. My data looks like this:
visitor | session_num | site | page_view_num | timestamp |
---|---|---|---|---|
User A | 2 | Site A | 1 | 2024-03-22 11:41:10 |
User A | 2 | Site A | 2 | 2024-03-22 11:41:14 |
User A | 2 | Site A | 3 | 2024-03-22 11:41:16 |
User A | 2 | Site B | 4 | 2024-03-22 11:41:47 |
User A | 2 | Site B | 5 | 2024-03-22 11:42:18 |
User A | 2 | Site B | 6 | 2024-03-22 11:42:19 |
User A | 2 | Site C | 7 | 2024-03-22 11:42:55 |
User A | 2 | Site C | 8 | 2024-03-22 11:43:18 |
User A | 2 | Site C | 9 | 2024-03-22 11:43:31 |
User A | 2 | Site A | 10 | 2024-03-22 11:52:19 |
User A | 2 | Site A | 11 | 2024-03-22 10:52:21 |
User B | 4 | Site B | 1 | 2024-03-25 10:52:30 |
User B | 4 | Site B | 2 | 2024-03-25 10:52:40 |
User B | 4 | Site A | 3 | 2024-03-25 10:53:50 |
User B | 4 | Site B | 4 | 2024-03-25 10:54:19 |
I am trying to achieve something like this (**site_num **column)
visitor | session_num | site | page_view_num | site_num | timestamp |
---|---|---|---|---|---|
User A | 2 | Site A | 1 | 1 | 2024-03-22 11:41:10 |
User A | 2 | Site A | 2 | 1 | 2024-03-22 11:41:14 |
User A | 2 | Site A | 3 | 1 | 2024-03-22 11:41:16 |
User A | 2 | Site B | 4 | 2 | 2024-03-22 11:41:47 |
User A | 2 | Site B | 5 | 2 | 2024-03-22 11:42:18 |
User A | 2 | Site B | 6 | 2 | 2024-03-22 11:42:19 |
User A | 2 | Site C | 7 | 3 | 2024-03-22 11:42:55 |
User A | 2 | Site C | 8 | 3 | 2024-03-22 11:43:18 |
User A | 2 | Site C | 9 | 3 | 2024-03-22 11:43:31 |
User A | 2 | Site A | 10 | 4 | 2024-03-22 11:52:19 |
User A | 2 | Site A | 11 | 4 | 2024-03-22 10:52:21 |
User B | 4 | Site B | 1 | 1 | 2024-03-25 10:52:30 |
User B | 4 | Site B | 2 | 1 | 2024-03-25 10:52:40 |
User B | 4 | Site A | 3 | 2 | 2024-03-25 10:53:50 |
User B | 4 | Site B | 4 | 3 | 2024-03-25 10:54:19 |
Essentially visits to each site should be grouped sequentially (partitioned by visitor, session_num) - I've tried various window functions but I'm struggling with cases when a user goes back to a site they have already visited early in the session, which needs to be counted as a separate instance
e.g. for User A the sequence is: Site A (site_num=1) > Site B (site_num=2) > Site C (site_num=3) > Site A (site_num=4)
In PostgreSQL, see the solution by @The Impaler
.
In Vertica, you should use the CONDITIONAL_TRUE_EVENT()
OLAP function, which is Vertica's means to "sessionize" your time series data. (where "sessionizing" and "solving the gaps-and-islands question" mean the same).
Without that function, you need to nest two OLAP queries into each other. The inner has a counter that is at 0 when the visitor is the same as before, and at 1 when the visitor changes. The outer selects from the inner, creating the running sum of the counter obtained previously.
I get exactly your result if I change the timestamp 2024-03-22 10:52:21
to 2024-03-22 11:52:21
- which would be consistent with the rules you state in the description.
And I'm using the named window , which you could also use in PostgreSQL, and add the CONDITIONAL_CHANGE_EVENT()
function, which you could use alternatively.
WITH
-- your input
indata(visitor,session_num,site,page_view_num,timestamp) AS (
SELECT 'User A',2,'Site A', 1,TIMESTAMP '2024-03-22 11:41:10'
UNION ALL SELECT 'User A',2,'Site A', 2,TIMESTAMP '2024-03-22 11:41:14'
UNION ALL SELECT 'User A',2,'Site A', 3,TIMESTAMP '2024-03-22 11:41:16'
UNION ALL SELECT 'User A',2,'Site B', 4,TIMESTAMP '2024-03-22 11:41:47'
UNION ALL SELECT 'User A',2,'Site B', 5,TIMESTAMP '2024-03-22 11:42:18'
UNION ALL SELECT 'User A',2,'Site B', 6,TIMESTAMP '2024-03-22 11:42:19'
UNION ALL SELECT 'User A',2,'Site C', 7,TIMESTAMP '2024-03-22 11:42:55'
UNION ALL SELECT 'User A',2,'Site C', 8,TIMESTAMP '2024-03-22 11:43:18'
UNION ALL SELECT 'User A',2,'Site C', 9,TIMESTAMP '2024-03-22 11:43:31'
UNION ALL SELECT 'User A',2,'Site A',10,TIMESTAMP '2024-03-22 11:52:19'
UNION ALL SELECT 'User A',2,'Site A',11,TIMESTAMP '2024-03-22 11:52:21'
UNION ALL SELECT 'User B',4,'Site B', 1,TIMESTAMP '2024-03-25 10:52:30'
UNION ALL SELECT 'User B',4,'Site B', 2,TIMESTAMP '2024-03-25 10:52:40'
UNION ALL SELECT 'User B',4,'Site A', 3,TIMESTAMP '2024-03-25 10:53:50'
UNION ALL SELECT 'User B',4,'Site B', 4,TIMESTAMP '2024-03-25 10:54:19'
)
-- real query starts here, replace following comma with "WITH"
SELECT
visitor
, session_num
, site
, page_view_num
, CONDITIONAL_TRUE_EVENT(LAG(site,1,'site?')<> site) OVER w AS site_num
, CONDITIONAL_CHANGE_EVENT(site) OVER w + 1 AS site_num_chg
, timestamp
FROM indata
WINDOW w AS (PARTITION BY visitor ORDER BY timestamp)
ORDER BY visitor,timestamp;
visitor | session_num | site | page_view_num | site_num | site_num_chg | timestamp |
---|---|---|---|---|---|---|
User A | 2 | Site A | 1 | 1 | 1 | 2024-03-22 11:41:10 |
User A | 2 | Site A | 2 | 1 | 1 | 2024-03-22 11:41:14 |
User A | 2 | Site A | 3 | 1 | 1 | 2024-03-22 11:41:16 |
User A | 2 | Site B | 4 | 2 | 2 | 2024-03-22 11:41:47 |
User A | 2 | Site B | 5 | 2 | 2 | 2024-03-22 11:42:18 |
User A | 2 | Site B | 6 | 2 | 2 | 2024-03-22 11:42:19 |
User A | 2 | Site C | 7 | 3 | 3 | 2024-03-22 11:42:55 |
User A | 2 | Site C | 8 | 3 | 3 | 2024-03-22 11:43:18 |
User A | 2 | Site C | 9 | 3 | 3 | 2024-03-22 11:43:31 |
User A | 2 | Site A | 10 | 4 | 4 | 2024-03-22 11:52:19 |
User A | 2 | Site A | 11 | 4 | 4 | 2024-03-22 11:52:21 |
User B | 4 | Site B | 1 | 1 | 1 | 2024-03-25 10:52:30 |
User B | 4 | Site B | 2 | 1 | 1 | 2024-03-25 10:52:40 |
User B | 4 | Site A | 3 | 2 | 2 | 2024-03-25 10:53:50 |
User B | 4 | Site B | 4 | 3 | 3 | 2024-03-25 10:54:19 |