sqlpostgresqlvertica

SQL query to track a sequence of sites visited by a user in a session


I need to track the sequence of sites visited by a user in their session. My data looks like this:

visitor session_num site page_view_num timestamp
User A 2 Site A 1 2024-03-22 11:41:10
User A 2 Site A 2 2024-03-22 11:41:14
User A 2 Site A 3 2024-03-22 11:41:16
User A 2 Site B 4 2024-03-22 11:41:47
User A 2 Site B 5 2024-03-22 11:42:18
User A 2 Site B 6 2024-03-22 11:42:19
User A 2 Site C 7 2024-03-22 11:42:55
User A 2 Site C 8 2024-03-22 11:43:18
User A 2 Site C 9 2024-03-22 11:43:31
User A 2 Site A 10 2024-03-22 11:52:19
User A 2 Site A 11 2024-03-22 10:52:21
User B 4 Site B 1 2024-03-25 10:52:30
User B 4 Site B 2 2024-03-25 10:52:40
User B 4 Site A 3 2024-03-25 10:53:50
User B 4 Site B 4 2024-03-25 10:54:19

I am trying to achieve something like this (**site_num **column)

visitor session_num site page_view_num site_num timestamp
User A 2 Site A 1 1 2024-03-22 11:41:10
User A 2 Site A 2 1 2024-03-22 11:41:14
User A 2 Site A 3 1 2024-03-22 11:41:16
User A 2 Site B 4 2 2024-03-22 11:41:47
User A 2 Site B 5 2 2024-03-22 11:42:18
User A 2 Site B 6 2 2024-03-22 11:42:19
User A 2 Site C 7 3 2024-03-22 11:42:55
User A 2 Site C 8 3 2024-03-22 11:43:18
User A 2 Site C 9 3 2024-03-22 11:43:31
User A 2 Site A 10 4 2024-03-22 11:52:19
User A 2 Site A 11 4 2024-03-22 10:52:21
User B 4 Site B 1 1 2024-03-25 10:52:30
User B 4 Site B 2 1 2024-03-25 10:52:40
User B 4 Site A 3 2 2024-03-25 10:53:50
User B 4 Site B 4 3 2024-03-25 10:54:19

Essentially visits to each site should be grouped sequentially (partitioned by visitor, session_num) - I've tried various window functions but I'm struggling with cases when a user goes back to a site they have already visited early in the session, which needs to be counted as a separate instance

e.g. for User A the sequence is: Site A (site_num=1) > Site B (site_num=2) > Site C (site_num=3) > Site A (site_num=4)


Solution

  • In PostgreSQL, see the solution by @The Impaler.

    In Vertica, you should use the CONDITIONAL_TRUE_EVENT() OLAP function, which is Vertica's means to "sessionize" your time series data. (where "sessionizing" and "solving the gaps-and-islands question" mean the same).

    Without that function, you need to nest two OLAP queries into each other. The inner has a counter that is at 0 when the visitor is the same as before, and at 1 when the visitor changes. The outer selects from the inner, creating the running sum of the counter obtained previously.

    I get exactly your result if I change the timestamp 2024-03-22 10:52:21 to 2024-03-22 11:52:21 - which would be consistent with the rules you state in the description.

    And I'm using the named window , which you could also use in PostgreSQL, and add the CONDITIONAL_CHANGE_EVENT() function, which you could use alternatively.

    WITH
    -- your input
    indata(visitor,session_num,site,page_view_num,timestamp) AS (
              SELECT 'User A',2,'Site A', 1,TIMESTAMP '2024-03-22 11:41:10'
    UNION ALL SELECT 'User A',2,'Site A', 2,TIMESTAMP '2024-03-22 11:41:14'
    UNION ALL SELECT 'User A',2,'Site A', 3,TIMESTAMP '2024-03-22 11:41:16'
    UNION ALL SELECT 'User A',2,'Site B', 4,TIMESTAMP '2024-03-22 11:41:47'
    UNION ALL SELECT 'User A',2,'Site B', 5,TIMESTAMP '2024-03-22 11:42:18'
    UNION ALL SELECT 'User A',2,'Site B', 6,TIMESTAMP '2024-03-22 11:42:19'
    UNION ALL SELECT 'User A',2,'Site C', 7,TIMESTAMP '2024-03-22 11:42:55'
    UNION ALL SELECT 'User A',2,'Site C', 8,TIMESTAMP '2024-03-22 11:43:18'
    UNION ALL SELECT 'User A',2,'Site C', 9,TIMESTAMP '2024-03-22 11:43:31'
    UNION ALL SELECT 'User A',2,'Site A',10,TIMESTAMP '2024-03-22 11:52:19'
    UNION ALL SELECT 'User A',2,'Site A',11,TIMESTAMP '2024-03-22 11:52:21'
    UNION ALL SELECT 'User B',4,'Site B', 1,TIMESTAMP '2024-03-25 10:52:30'
    UNION ALL SELECT 'User B',4,'Site B', 2,TIMESTAMP '2024-03-25 10:52:40'
    UNION ALL SELECT 'User B',4,'Site A', 3,TIMESTAMP '2024-03-25 10:53:50'
    UNION ALL SELECT 'User B',4,'Site B', 4,TIMESTAMP '2024-03-25 10:54:19'
    )
    -- real query starts here, replace following comma with "WITH"
    SELECT
      visitor
    , session_num
    , site
    , page_view_num
    , CONDITIONAL_TRUE_EVENT(LAG(site,1,'site?')<> site) OVER w AS site_num
    , CONDITIONAL_CHANGE_EVENT(site) OVER w + 1 AS site_num_chg
    , timestamp
    FROM indata
    WINDOW w AS (PARTITION BY visitor ORDER BY timestamp) 
    ORDER BY visitor,timestamp;
    
    visitor session_num site page_view_num site_num site_num_chg timestamp
    User A 2 Site A 1 1 1 2024-03-22 11:41:10
    User A 2 Site A 2 1 1 2024-03-22 11:41:14
    User A 2 Site A 3 1 1 2024-03-22 11:41:16
    User A 2 Site B 4 2 2 2024-03-22 11:41:47
    User A 2 Site B 5 2 2 2024-03-22 11:42:18
    User A 2 Site B 6 2 2 2024-03-22 11:42:19
    User A 2 Site C 7 3 3 2024-03-22 11:42:55
    User A 2 Site C 8 3 3 2024-03-22 11:43:18
    User A 2 Site C 9 3 3 2024-03-22 11:43:31
    User A 2 Site A 10 4 4 2024-03-22 11:52:19
    User A 2 Site A 11 4 4 2024-03-22 11:52:21
    User B 4 Site B 1 1 1 2024-03-25 10:52:30
    User B 4 Site B 2 1 1 2024-03-25 10:52:40
    User B 4 Site A 3 2 2 2024-03-25 10:53:50
    User B 4 Site B 4 3 3 2024-03-25 10:54:19