I have a clickstream .csv
data file in this format :
Step_Name Session_ID
p-1 1
p-2 1
defer 1
p-1 2
p-3 2
buy 2
p-4 3
defer 3
I am trying to use clickstream
package in R for analysis. This package requires us to convert the above data into this format :
1 p-1 p-2 defer
2 p-1 p-3 buy
3 p-4 defer
where the first column is Session ID and number of columns may vary for each session depending on the number of steps the website user has taken. How do I get the initial data into the above format (in either excel or R)?
Also, if there are other ways to analyse clickstream data in R without such data manipulation, please let me know.
Using data.table, load your data:
> library(data.table)
> df = data.table('Step_Name' = c('p-1', 'p-2', 'defer','p-1','p-3','buy','p-4','defer'), 'Session_ID' = c(rep(1,3),rep(2,3),rep(3,2)))
> df
Step_Name Session_ID
1: p-1 1
2: p-2 1
3: defer 1
4: p-1 2
5: p-3 2
6: buy 2
7: p-4 3
8: defer 3
Add a row number ID, restarting for each new Session_ID
.
> df[, id := seq_len(.N), by = Session_ID]
Cast your data table:
> df_final = dcast(df, Session_ID ~ id, value.var = 'Step_Name')
And your desired result:
> df_final
Session_ID 1 2 3
1: 1 p-1 p-2 defer
2: 2 p-1 p-3 buy
3: 3 p-4 defer <NA>
NB: This relies on your data being in order within Session_ID
.