I have a dataset that looks like this:
Session | UserID | Duration(min) | id | Page# |
---|---|---|---|---|
123 | 345 | 40 | 1 | Pg1 |
123 | 345 | 40 | 2 | Pg2 |
123 | 345 | 40 | 3 | Pg2 |
123 | 345 | 40 | 4 | Pg3 |
123 | 345 | 40 | 5 | Pg2 |
123 | 345 | 40 | 6 | Pg3 |
124 | 346 | 100 | 1 | Pg2 |
124 | 346 | 100 | 2 | Pg2 |
124 | 346 | 100 | 3 | Pg3 |
124 | 346 | 100 | 4 | Pg4 |
125 | 347 | 59 | 1 | Pg1 |
125 | 347 | 59 | 2 | Pg5 |
I would like the dataset to look like this (below) because it would eventually create a more meaningful user journey.
Session | UserID | Duration(min) | id | Page# |
---|---|---|---|---|
123 | 345 | 40 | 1 | Pg1 |
123 | 345 | 40 | 2 | Pg2 |
123 | 345 | 40 | 3 | Pg3 |
123 | 345 | 40 | 4 | Pg2 |
123 | 345 | 40 | 5 | Pg3 |
124 | 346 | 100 | 1 | Pg2 |
124 | 346 | 100 | 2 | Pg3 |
124 | 346 | 100 | 3 | Pg4 |
125 | 347 | 59 | 1 | Pg1 |
125 | 347 | 59 | 2 | Pg5 |
Duration (min) = the number of minutes from the beginning of the path to the last page viewed. id = path number or path length number.
I am trying to take out subsequent repetitive events based on the session and user id, leaving the duration as is.
library('dplyr')
df %>%
# group rows by Session + UserID
group_by(Session,UserID) %>%
# filter rows where Page is different from the last Page (within group), or where last Page (within group) is NA
filter(Page != lag(Page, 1) | is.na(lag(Page, 1)) ) %>%
# create new ids
mutate(id = row_number()) %>%
# remove grouping, for future use
ungroup()
Result:
# A tibble: 10 × 5
Session UserID Duration.min. id Page
<int> <int> <int> <int> <chr>
1 123 345 40 1 Pg1
2 123 345 40 2 Pg2
3 123 345 40 3 Pg3
4 123 345 40 4 Pg2
5 123 345 40 5 Pg3
6 124 346 100 1 Pg2
7 124 346 100 2 Pg3
8 124 346 100 3 Pg4
9 125 347 59 1 Pg1
10 125 347 59 2 Pg5