I have some clickstream data I'd like to attribution analyze in a particular way, but I need to get into a specific format for users that convert and those that don't.
Reprex data:
df <- structure(list(User_ID = c(2001, 2001, 2001, 2002, 2001, 2002,
2001, 2002, 2002, 2003, 2003, 2001, 2002, 2002, 2001), Session_ID = c("1001",
"1002", "1003", "1004", "1005", "1006", "1007", "Not Set", "Not Set",
"Not Set", "Not Set", "Not Set", "1008", "1009", "Not Set"),
Date_time = structure(c(1540103940, 1540104060, 1540104240,
1540318080, 1540318680, 1540318859, 1540314360, 1540413060,
1540413240, 1540538460, 1540538640, 1540629660, 1540755060,
1540755240, 1540803000), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
Source = c("Facebook", "Facebook", "Facebook", "Google",
"Email", "Google", "Email", "Referral", "Referral", "Facebook",
"Facebook", "Google", "Referral", "Direct", "Direct"), Conversion = c(0,
0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -15L), spec = structure(list(
cols = list(User_ID = structure(list(), class = c("collector_double",
"collector")), Session_ID = structure(list(), class = c("collector_character",
"collector")), Date_time = structure(list(format = ""), class = c("collector_datetime",
"collector")), Source = structure(list(), class = c("collector_character",
"collector")), Conversion = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
Then set classes:
df <- df %>%
mutate(User_ID = as.factor(User_ID),
Session_ID = as.factor(Session_ID),
Date_time = as.POSIXct(Date_time)
)
I'd like to get all user visit paths to purchase, or total paths for ones that do not lead to purchase.
The format for the new column path
would be for example: Facebook > Facebook > Facebook > Email > Email
for user 2001 which I know how to achieve using
mutate(path = paste0(source, collapse = " > "))
The complications are:
Each row would be either:
path
column would reflect the journey to
conversion - for a user's second or subsequent conversion only the
path subsequent to the previous conversion would be shown.For the above reprex the result would look like below:
# A tibble: 5 x 5
User_ID Session_ID Date_time Conversion Path
<dbl> <chr> <dttm> <dbl> <chr>
1 2001 1007 2018-10-23 17:06:00 1 Facebook > Facebook > Facebook > Email > Email
2 2002 Not Set 2018-10-24 20:34:00 1 Google > Google > Referral > Referral
3 2003 Not Set 2018-10-26 07:24:00 0 Facebook > Facebook
4 2002 1009 2018-10-28 19:34:00 0 Referral > Direct
5 2001 Not Set 2018-10-29 08:50:00 1 Google > Direct
... where:
Here's an approach using dplyr
:
df2 <- df %>%
# Add a column to distinguish between known and unknown sessions
mutate(known_session = Session_ID != "Not Set") %>%
# For each user, split between know and unknown sessions...
group_by(User_ID, known_session) %>%
# Sort first by Session ID, then time
arrange(Session_ID, Date_time) %>%
# Track which # path they're on. Start with path #1;
# new path if prior event was a conversion
mutate(path_num = cumsum(lag(Conversion, default = 0)) + 1) %>%
# Label path journey by combining everything so far
mutate(Path = paste0(Source, collapse = " > ")) %>%
# Just keep last step in each path
filter(row_number() == n()) %>%
ungroup() %>%
# Tidying up with just the desired columns, chronological
select(User_ID, Session_ID, Date_time, Conversion, Path) %>%
arrange(Date_time)
I get slightly different results, but I think they correspond to the sample data as provided:
> df2
# A tibble: 5 x 5
User_ID Session_ID Date_time
Conversion Path
<fct> <fct> <dttm> <dbl> <chr>
1 2001 1007 2018-10-23 17:06:00 1 Facebook > Facebook > Facebook > Email > Email
2 2002 Not Set 2018-10-24 20:34:00 1 Referral > Referral
3 2003 Not Set 2018-10-26 07:24:00 0 Facebook > Facebook
4 2002 1009 2018-10-28 19:34:00 0 Google > Google > Referral > Direct
5 2001 Not Set 2018-10-29 08:50:00 1 Google > Direct