rclickstream

Dealing with repetitive events in clickstream data in R


I have a dataset that looks like this:

Session UserID Duration(min) id Page#
123 345 40 1 Pg1
123 345 40 2 Pg2
123 345 40 3 Pg2
123 345 40 4 Pg3
123 345 40 5 Pg2
123 345 40 6 Pg3
124 346 100 1 Pg2
124 346 100 2 Pg2
124 346 100 3 Pg3
124 346 100 4 Pg4
125 347 59 1 Pg1
125 347 59 2 Pg5

I would like the dataset to look like this (below) because it would eventually create a more meaningful user journey.

Session UserID Duration(min) id Page#
123 345 40 1 Pg1
123 345 40 2 Pg2
123 345 40 3 Pg3
123 345 40 4 Pg2
123 345 40 5 Pg3
124 346 100 1 Pg2
124 346 100 2 Pg3
124 346 100 3 Pg4
125 347 59 1 Pg1
125 347 59 2 Pg5

Duration (min) = the number of minutes from the beginning of the path to the last page viewed. id = path number or path length number.

I am trying to take out subsequent repetitive events based on the session and user id, leaving the duration as is.


Solution

  • library('dplyr')
    
    df %>%
      
      # group rows by Session + UserID
      group_by(Session,UserID) %>%
      
      # filter rows where Page is different from the last Page (within group), or where last Page (within group) is NA
      filter(Page != lag(Page, 1) | is.na(lag(Page, 1)) ) %>%
      
      # create new ids
      mutate(id = row_number()) %>%
      
      # remove grouping, for future use
      ungroup()
    

    Result:

    # A tibble: 10 × 5
       Session UserID Duration.min.    id Page 
         <int>  <int>         <int> <int> <chr>
     1     123    345            40     1 Pg1  
     2     123    345            40     2 Pg2  
     3     123    345            40     3 Pg3  
     4     123    345            40     4 Pg2  
     5     123    345            40     5 Pg3  
     6     124    346           100     1 Pg2  
     7     124    346           100     2 Pg3  
     8     124    346           100     3 Pg4  
     9     125    347            59     1 Pg1  
    10     125    347            59     2 Pg5