I am looking for help with creating and tracking user sessions and activities within sessions using R. At a high level I have a column of user Ids and a column of timestamps.
For each user ID I want to calculate the time difference between timestamps and use that to assign a session number based on a time limit value. (i.e. the first timestamp for the user gets a session number of 1. If the next time stamp is < 30 minutes later, the session number would remain as 1, otherwise it would be incremented to 2, and so on.)
Once a session number is assigned for each user action observation I then want to sequence the activities within each session. The first observation for each session would have an activity number of 1, the second observation within the session would have an activity number of 2 and so on. When the session number changes (or the user ID changes) the activity number would reset back to 1.
I have the following repro which does what I am looking for however, this code is very slow on my full dataset (1.4 million rows). In particular, the calls to difftime
and the for loops are very slow. I am wondering if there is a better way to do this and am open to multiple ideas. Perhaps there is a way to do this without for loops using purrr or perhaps this is an excellent use case for a parallel library. Happy to hear any suggestions for how to make this process easier.
library(tidyverse)
session_time_limit <- 30
x <- tibble(ID = c("a", "a", "a", "a", "b", "b", "c", "c", "c"),
Date = c(as.POSIXct("2021-01-25 19:17:12 UTC"), #a1
as.POSIXct("2021-01-25 19:17:30 UTC"), #a2
as.POSIXct("2021-01-25 19:57:12 UTC"), #a3
as.POSIXct("2021-01-25 19:59:12 UTC"), #a4
as.POSIXct("2021-01-25 20:11:12 UTC"), #b1
as.POSIXct("2021-01-25 20:42:12 UTC"), #b2
as.POSIXct("2021-01-25 21:15:42 UTC"), #c1
as.POSIXct("2021-01-25 21:17:12 UTC"), #c2
as.POSIXct("2021-01-25 21:20:13 UTC"))) #c3
x <- x %>%
arrange(ID, Date) %>%
group_by(ID) %>%
mutate(tdiff = difftime(Date, lag(Date), units = "mins"),
session_number = 1,
activity_within_session = 1)
for(i in seq(2, nrow(x))) {
if(!is.na(x$tdiff[i]) &
# x$ID[i] == x$ID[i-1] &
x$tdiff[i] > session_time_limit)
{
x$session_number[i] = x$session_number[i-1] + 1
} else if(!is.na(x$tdiff[i]) &
x$ID[i] == x$ID[i-1] &
x$tdiff[i] <= session_time_limit)
{
x$session_number[i] = x$session_number[i-1]
}
}
# Activity within Session
for(i in seq(2, nrow(x))) {
if(!is.na(x$tdiff[i]) &
# x$ID[i] == x$ID[i-1] &
x$session_number[i] == x$session_number[i-1])
{
x$activity_within_session[i] =
x$activity_within_session[i-1] + 1
}
}
An option could be:
library(tidyverse)
library(lubridate)
session_time_limit <- 30
df <- tibble(ID = c("a", "a", "a", "a", "b", "b", "c", "c", "c"),
Date = c(as.POSIXct("2021-01-25 19:17:12 UTC"), #a1
as.POSIXct("2021-01-25 19:17:30 UTC"), #a2
as.POSIXct("2021-01-25 19:57:12 UTC"), #a3
as.POSIXct("2021-01-25 19:59:12 UTC"), #a4
as.POSIXct("2021-01-25 20:11:12 UTC"), #b1
as.POSIXct("2021-01-25 20:42:12 UTC"), #b2
as.POSIXct("2021-01-25 21:15:42 UTC"), #c1
as.POSIXct("2021-01-25 21:17:12 UTC"), #c2
as.POSIXct("2021-01-25 21:20:13 UTC"))) #c3
df %>%
group_by(ID) %>%
mutate(tdiff = replace_na(interval(lag(Date), Date)/dminutes(1), 0),
session_number = cumsum(tdiff > session_time_limit) + 1) %>%
group_by(session_number, .add = T) %>%
mutate(activity_within_session = row_number()) %>%
ungroup()
#> # A tibble: 9 x 5
#> ID Date tdiff session_number activity_within_session
#> <chr> <dttm> <dbl> <dbl> <int>
#> 1 a 2021-01-25 19:17:12 0 1 1
#> 2 a 2021-01-25 19:17:30 0.3 1 2
#> 3 a 2021-01-25 19:57:12 39.7 2 1
#> 4 a 2021-01-25 19:59:12 2 2 2
#> 5 b 2021-01-25 20:11:12 0 1 1
#> 6 b 2021-01-25 20:42:12 31 2 1
#> 7 c 2021-01-25 21:15:42 0 1 1
#> 8 c 2021-01-25 21:17:12 1.5 1 2
#> 9 c 2021-01-25 21:20:13 3.02 1 3
Created on 2021-05-26 by the reprex package (v2.0.0)
data.table
library(data.table)
setDT(df)
df[, tdiff := c(0, diff(as.numeric(Date)) / 60), by = ID
][, session_number := cumsum(tdiff > session_time_limit) + 1, by = ID
][, activity_within_session := rowid(ID, session_number)][]
or
library(data.table)
library(magrittr)
df[, tdiff := c(0, diff(as.numeric(Date)) / 60), by = ID] %>%
.[, session_number := cumsum(tdiff > session_time_limit) + 1, by = ID] %>%
.[, activity_within_session := rowid(ID, session_number)] %>%
.[]
ID Date tdiff session_number activity_within_session
1: a 2021-01-25 19:17:12 0.000000 1 1
2: a 2021-01-25 19:17:30 0.300000 1 2
3: a 2021-01-25 19:57:12 39.700000 2 1
4: a 2021-01-25 19:59:12 2.000000 2 2
5: b 2021-01-25 20:11:12 0.000000 1 1
6: b 2021-01-25 20:42:12 31.000000 2 1
7: c 2021-01-25 21:15:42 0.000000 1 1
8: c 2021-01-25 21:17:12 1.500000 1 2
9: c 2021-01-25 21:20:13 3.016667 1 3