sqldatabasetime-seriesquestdb

Funnel or Sankey diagram from flat events table


I am trying to build a user-flow or sankey diagram so I could find out what are the visited pages based on visitor_id or pathname. I would like to be able to see which pages contribute visits to others, and in which proportion. It would also be nice to have things like elapsed time, or number of pages in a single session for a user, or which are the entry and exit pages and so on, similar to web analytics tool.

My issue is that all I capture is a flat table with events, no concept or session as a user can navigate at any time. For the purpose of my analytics, I would define a session like a visit that was more than 1 hour apart from the last one for the same user.

My (simplified) table schema looks like this:

CREATE TABLE events (
    visitor_id SYMBOL,
    pathname SYMBOL,
    timestamp TIMESTAMP,
    metric_name SYMBOL
) TIMESTAMP(timestamp) PARTITION BY MONTH WAL;
select CAST(pathname as long) from events;

I have tried different queries, but I struggle to get all I want in a single one. As an example, this is a query I am trying to use for entry pages, but it lacks the rest of features:

WITH FirstVisit AS (
  SELECT visitor_id, min(timestamp) AS first_visit, pathname AS landing_page
  FROM events WHERE metric_name = 'page_view' AND timestamp > dateadd('d', -7, now()) 
  GROUP BY visitor_id, landing_page
) 
SELECT vit.pathname, count(*) AS pageviews FROM vitals as vit
INNER JOIN FirstVisit fv ON vit.visitor_id = fv.visitor_id
WHERE vit.metric_name = 'page_view' AND vit.pathname != fv.landing_page AND vit.timestamp > fv.first_visit AND timestamp > dateadd('d', -7, now()) 
GROUP BY vit.pathname
ORDER BY pageviews DESC
LIMIT 5;

To make things easier, I have generated a demo CSV file you can use to populate a table with simulated data for 4 visitors during 7 days, with 2 sessions per day.


Solution

  • By combining window functions and CASE statements we can do this:

    1. Sessionize the data by identifying gaps longer than 1 hour. I do it in the PrevEvents subquery to identify gaps, and the VisitorSession subquery to number sessions for each user.
    2. Generate unique session ids, so we can do aggregations scope to individual sessions. I do it in the GlobalSessions subquery, by concatenating user_id and number. It would be nice if we could use some hashing function to anonymise the user id, but I am not aware of a simple way of doing this in QuestDB.
    3. Assign sequence numbers to each hit within a session, so we can reason about entry, exit, and bounce pages. I do this in the EventSequences query. Note that I assign sequences order by timestamp both ASC and DESC, so we can find also exit pages.
    4. Assign the session initial timestamp. I do it also at the EventSequences query, as at that point we already have unique global session IDs.
    5. Check next page in the sequence. I wanted to do this with a Window Function, but QuestDB as of today does not support anything but a number as the first_value argument. Instead I am doing a left outer join filtering the row where the sequence is 1 higher than the current. I am doing this at the EventsFullInfo subquery.

    With that, we have the base to do analytics, and we can already count page hits for the next page from current, identify elapsed time between hits or since the start of the session, count sessions per user, or queries to power navigation funnels and sankey diagrams.

    WITH PrevEvents AS (
      SELECT
        visitor_id,
        pathname,
        timestamp,
        first_value(timestamp::long) OVER (
        PARTITION BY visitor_id ORDER BY timestamp 
        ROWS 1 PRECEDING EXCLUDE CURRENT ROW
        ) AS prev_ts
      FROM
        events where timestamp > dateadd('d', -7, now()) 
        and metric_name = 'page_view' 
    ), VisitorSessions AS (
      select *, 
        SUM(CASE WHEN datediff('h', timestamp, prev_ts::timestamp)>1 THEN 1 END) 
        OVER(
          PARTITION BY visitor_id 
          ORDER BY timestamp
        ) as local_session_id from PrevEvents
    
    ), GlobalSessions AS (
      select visitor_id, pathname, timestamp, prev_ts, 
        concat(visitor_id, '#', coalesce(local_session_id,0)::int) AS session_id
      FROM VisitorSessions
    ), EventSequences AS (
      select *, row_number() OVER (
          PARTITION BY session_id ORDER BY timestamp
        ) as session_sequence, 
        row_number() OVER (
          PARTITION BY session_id ORDER BY timestamp DESC
        ) as reverse_session_sequence,
        first_value(timestamp::long) OVER (
          PARTITION BY session_id ORDER BY timestamp
        ) as session_ts 
      from GlobalSessions
    ), EventsFullInfo AS (
      select e1.session_id, e1.session_ts::timestamp as session_ts, e1.visitor_id,
        e1.timestamp, e1.pathname, e1.session_sequence, 
        CASE WHEN e1.session_sequence = 1 THEN true END is_entry_page, 
        e2.pathname, datediff('T', e1.timestamp, e1.prev_ts::timestamp)::double as elapsed,
        e2.reverse_session_sequence, 
        CASE WHEN e2.reverse_session_sequence = 1 THEN true END is_exit_page,
      FROM EventSequences e1
      left join EventSequences e2 ON (session_id) 
      where e2.session_sequence - e1.session_sequence = 1
    )
    SELECT * from EventsFullInfo;
    

    You can see at the image below a sample result using the provided demo CSV file.

    Screenshot from QuestDB web console displaying the session_id, session_ts, visitor_id, timestamp, pathname, session_sequence, is_entry_page, next_pathname, elapsed_since_prev, reverse_session_sequence, is_exit_page, and is_bounce columns