rclickstream

How to create user paths from clickstream data


I have some clickstream data I'd like to attribution analyze in a particular way, but I need to get into a specific format for users that convert and those that don't.

Reprex data:

df <- structure(list(User_ID = c(2001, 2001, 2001, 2002, 2001, 2002, 
                             2001, 2002, 2002, 2003, 2003, 2001, 2002, 2002, 2001), Session_ID = c("1001", 
                                                                                                   "1002", "1003", "1004", "1005", "1006", "1007", "Not Set", "Not Set", 
                                                                                                   "Not Set", "Not Set", "Not Set", "1008", "1009", "Not Set"), 
                 Date_time = structure(c(1540103940, 1540104060, 1540104240, 
                                         1540318080, 1540318680, 1540318859, 1540314360, 1540413060, 
                                         1540413240, 1540538460, 1540538640, 1540629660, 1540755060, 
                                         1540755240, 1540803000), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
                 Source = c("Facebook", "Facebook", "Facebook", "Google", 
                            "Email", "Google", "Email", "Referral", "Referral", "Facebook", 
                            "Facebook", "Google", "Referral", "Direct", "Direct"), Conversion = c(0, 
                                                                                                  0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1)), class = c("spec_tbl_df", 
                                                                                                                                                        "tbl_df", "tbl", "data.frame"), row.names = c(NA, -15L), spec = structure(list(
                                                                                                                                                          cols = list(User_ID = structure(list(), class = c("collector_double", 
                                                                                                                                                                                                            "collector")), Session_ID = structure(list(), class = c("collector_character", 
                                                                                                                                                                                                                                                                    "collector")), Date_time = structure(list(format = ""), class = c("collector_datetime", 
                                                                                                                                                                                                                                                                                                                                      "collector")), Source = structure(list(), class = c("collector_character", 
                                                                                                                                                                                                                                                                                                                                                                                          "collector")), Conversion = structure(list(), class = c("collector_double", 
                                                                                                                                                                                                                                                                                                                                                                                                                                                  "collector"))), default = structure(list(), class = c("collector_guess", 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "collector")), skip = 1), class = "col_spec"))

Then set classes:

df <- df %>% 
  mutate(User_ID    = as.factor(User_ID),
         Session_ID = as.factor(Session_ID),
         Date_time  = as.POSIXct(Date_time)
         )

I'd like to get all user visit paths to purchase, or total paths for ones that do not lead to purchase.

The format for the new column path would be for example: Facebook > Facebook > Facebook > Email > Email for user 2001 which I know how to achieve using mutate(path = paste0(source, collapse = " > "))

The complications are:

Each row would be either:

For the above reprex the result would look like below:

# A tibble: 5 x 5
  User_ID Session_ID Date_time           Conversion Path                                          
    <dbl> <chr>      <dttm>                   <dbl> <chr>                                         
1    2001 1007       2018-10-23 17:06:00          1 Facebook > Facebook > Facebook > Email > Email
2    2002 Not Set    2018-10-24 20:34:00          1 Google > Google > Referral > Referral         
3    2003 Not Set    2018-10-26 07:24:00          0 Facebook > Facebook                           
4    2002 1009       2018-10-28 19:34:00          0 Referral > Direct                             
5    2001 Not Set    2018-10-29 08:50:00          1 Google > Direct     

... where:


Solution

  • Here's an approach using dplyr:

    df2 <- df %>%
      # Add a column to distinguish between known and unknown sessions
      mutate(known_session = Session_ID != "Not Set") %>%
    
      # For each user, split between know and unknown sessions...
      group_by(User_ID, known_session) %>%
      # Sort first by Session ID, then time
      arrange(Session_ID, Date_time) %>%
      # Track which # path they're on. Start with path #1; 
      #   new path if prior event was a conversion
      mutate(path_num = cumsum(lag(Conversion, default = 0)) + 1) %>%
    
      # Label path journey by combining everything so far
      mutate(Path = paste0(Source, collapse = " > ")) %>%
      # Just keep last step in each path
      filter(row_number() == n()) %>%
      ungroup() %>%
    
      # Tidying up with just the desired columns, chronological
      select(User_ID, Session_ID, Date_time, Conversion, Path) %>%
      arrange(Date_time)
    

    I get slightly different results, but I think they correspond to the sample data as provided:

    > df2
    # A tibble: 5 x 5
      User_ID Session_ID Date_time      
    
         Conversion Path                                          
      <fct>   <fct>      <dttm>                   <dbl> <chr>                                         
    1 2001    1007       2018-10-23 17:06:00          1 Facebook > Facebook > Facebook > Email > Email
    2 2002    Not Set    2018-10-24 20:34:00          1 Referral > Referral                           
    3 2003    Not Set    2018-10-26 07:24:00          0 Facebook > Facebook                           
    4 2002    1009       2018-10-28 19:34:00          0 Google > Google > Referral > Direct           
    5 2001    Not Set    2018-10-29 08:50:00          1 Google > Direct