When doing frequent sequence mining, one typically wants to do the following tasks:
1. Find sequential patterns (frequent sequences).
2. Find out which sequential patterns apply to a transaction. I.e.: given a transaction, which of all of the frequent sequences found is present?
I'm having trouble doing the latter.
Using R, I am applying the cspade-algorithm from the arulesSequences package on the following toy dataset:
data <- data.frame(id = 1:10,
transaction = c("A B B A",
"A B C B D C B B B F A",
"A A B",
"B A B A",
"A B B B B",
"A A A B",
"A B B A B B",
"E F F A C B D A B C D E",
"A B B A B",
"A B"))
Then I split the data using the str_split
function from package stringr:
data_for_fseq_mining <- str_split(string = data$transaction, pattern = " ")
Use identifiers to uniquely name the list elements in 'data_for_fseq_mining'. This is a prerequisite for using the function 'as.transactions' as shown below.
names(data_for_fseq_mining) <- data$id
In order to convert this kind of data to a dataset of class 'transactions' I use the following function as.transactions
from https://github.com/cran/clickstream/blob/master/R/Clickstream.r.
data_for_fseq_mining_trans <- as.transactions(clickstreamList = data_for_fseq_mining)
Now the data is in the proper format, I run the cspade-algorithm with some parameters:
sequences <- cspade(data = data_for_fseq_mining_trans,
parameter = list(support = 0.3, maxsize = 10, maxlen = 10, mingap = 1, maxgap = 10),
control = list(tidList = TRUE, verbose = TRUE))
Summarizing the results (sequence and relative support):
sequences_df <- cbind(sequence = labels(sequences), support = sequences@quality)
sequence support
1 <{A}> 1.0
2 <{B}> 1.0
3 <{A},{B}> 1.0
4 <{B},{B}> 0.7
5 <{A},{B},{B}> 0.6
6 <{B},{B},{B}> 0.4
7 <{A},{B},{B},{B}> 0.4
8 <{B},{B},{B},{B}> 0.3
9 <{A},{B},{B},{B},{B}> 0.3
10 <{A},{A},{B}> 0.5
11 <{B},{A},{B}> 0.4
12 <{A},{B},{A},{B}> 0.3
13 <{A},{A}> 0.8
14 <{B},{A}> 0.6
15 <{A},{B},{A}> 0.6
16 <{B},{B},{A}> 0.5
17 <{A},{B},{B},{A}> 0.4
That's perfectly fine, but now I would like to know, for each transaction, whether each sequence is present or not (TRUE/FALSE). To do this, I tried to use the tidList:
sequences_score <- as.matrix(sequences@tidLists@data)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17]
[1,] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
[2,] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[3,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
[4,] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[5,] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE
[6,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[7,] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[8,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[9,] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
[10,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
From this result, I assume each row corresponds to a transaction and each column to a sequence. But when looking at the 4th column, it says that pattern '
<{B},{B}>' is not present in transactions 2, 4 and 7. Though these transactions clearly all contain this pattern. Are my assumptions about the output wrong?
An alternative approach is to use this piece of code provided by juliesls: R arulesSequences Find which patterns are supported by a sequence
When applying the following lines of code, an error occurs.
ids <- unique(data_for_fseq_mining_trans@itemsetInfo$sequenceID)
sequences_score <- data.frame()
for (seq_id in 1:length(sequences)){
sequences_score[,labels(sequences[seq_id])] <- logical(0)
}
for (id in ids){
transaction_subset <- data_for_fseq_mining_trans[data_for_fseq_mining_trans@itemsetInfo$sequenceID==id]
sequences_score[id, ] <- as.logical(support(x = sequences, transactions =
transaction_subset, type="absolute"))
}
Any clues?
To see whether each sequence is present or not you can indeed use your provided code:
sequences_score <- as.matrix(sequences@tidLists@data)
However, you have to map the resulting matrix to your data using another property of your sequence object as follows:
# Get mapping ids, change to numeric values
mapping_ids <- as.numeric(sequences@tidLists@transactionInfo$sequenceID)
# Then map your matrix sequence_score to correspond to the order of your data
sequences_score <- sequences_score[order(mapping_ids), ]