I am trying to build a user-flow or sankey diagram so I could find out what are the visited pages based on visitor_id or pathname. I would like to be able to see which pages contribute visits to others, and in which proportion. It would also be nice to have things like elapsed time, or number of pages in a single session for a user, or which are the entry and exit pages and so on, similar to web analytics tool.
My issue is that all I capture is a flat table with events, no concept or session as a user can navigate at any time. For the purpose of my analytics, I would define a session like a visit that was more than 1 hour apart from the last one for the same user.
My (simplified) table schema looks like this:
CREATE TABLE events (
visitor_id SYMBOL,
pathname SYMBOL,
timestamp TIMESTAMP,
metric_name SYMBOL
) TIMESTAMP(timestamp) PARTITION BY MONTH WAL;
select CAST(pathname as long) from events;
I have tried different queries, but I struggle to get all I want in a single one. As an example, this is a query I am trying to use for entry pages, but it lacks the rest of features:
WITH FirstVisit AS (
SELECT visitor_id, min(timestamp) AS first_visit, pathname AS landing_page
FROM events WHERE metric_name = 'page_view' AND timestamp > dateadd('d', -7, now())
GROUP BY visitor_id, landing_page
)
SELECT vit.pathname, count(*) AS pageviews FROM vitals as vit
INNER JOIN FirstVisit fv ON vit.visitor_id = fv.visitor_id
WHERE vit.metric_name = 'page_view' AND vit.pathname != fv.landing_page AND vit.timestamp > fv.first_visit AND timestamp > dateadd('d', -7, now())
GROUP BY vit.pathname
ORDER BY pageviews DESC
LIMIT 5;
To make things easier, I have generated a demo CSV file you can use to populate a table with simulated data for 4 visitors during 7 days, with 2 sessions per day.
By combining window functions and CASE
statements we can do this:
PrevEvents
subquery to identify gaps, and the VisitorSession
subquery to number sessions for each user.GlobalSessions
subquery, by concatenating user_id and number. It would be nice if we could use some hashing function to anonymise the user id, but I am not aware of a simple way of doing this in QuestDB.EventSequences
query. Note that I assign sequences order by timestamp both ASC and DESC, so we can find also exit pages.EventSequences
query, as at that point we already have unique global session IDs.first_value
argument. Instead I am doing a left outer join
filtering the row where the sequence is 1 higher than the current. I am doing this at the EventsFullInfo
subquery.With that, we have the base to do analytics, and we can already count page hits for the next page from current, identify elapsed time between hits or since the start of the session, count sessions per user, or queries to power navigation funnels and sankey diagrams.
WITH PrevEvents AS (
SELECT
visitor_id,
pathname,
timestamp,
first_value(timestamp::long) OVER (
PARTITION BY visitor_id ORDER BY timestamp
ROWS 1 PRECEDING EXCLUDE CURRENT ROW
) AS prev_ts
FROM
events where timestamp > dateadd('d', -7, now())
and metric_name = 'page_view'
), VisitorSessions AS (
select *,
SUM(CASE WHEN datediff('h', timestamp, prev_ts::timestamp)>1 THEN 1 END)
OVER(
PARTITION BY visitor_id
ORDER BY timestamp
) as local_session_id from PrevEvents
), GlobalSessions AS (
select visitor_id, pathname, timestamp, prev_ts,
concat(visitor_id, '#', coalesce(local_session_id,0)::int) AS session_id
FROM VisitorSessions
), EventSequences AS (
select *, row_number() OVER (
PARTITION BY session_id ORDER BY timestamp
) as session_sequence,
row_number() OVER (
PARTITION BY session_id ORDER BY timestamp DESC
) as reverse_session_sequence,
first_value(timestamp::long) OVER (
PARTITION BY session_id ORDER BY timestamp
) as session_ts
from GlobalSessions
), EventsFullInfo AS (
select e1.session_id, e1.session_ts::timestamp as session_ts, e1.visitor_id,
e1.timestamp, e1.pathname, e1.session_sequence,
CASE WHEN e1.session_sequence = 1 THEN true END is_entry_page,
e2.pathname, datediff('T', e1.timestamp, e1.prev_ts::timestamp)::double as elapsed,
e2.reverse_session_sequence,
CASE WHEN e2.reverse_session_sequence = 1 THEN true END is_exit_page,
FROM EventSequences e1
left join EventSequences e2 ON (session_id)
where e2.session_sequence - e1.session_sequence = 1
)
SELECT * from EventsFullInfo;
You can see at the image below a sample result using the provided demo CSV file.