I have a dataset that looks like this:
structure(list(CATEGORY = c("Flower", "Flower", "Concentrate,Flower",
"Flower", "Flower", "Flower", "Flower", "Edible,Flower", "Concentrate,Flower",
"Flower", "Flower", "Flower", "Flower", "Flower", "Edible,Flower",
"Concentrate,Flower", "Flower", "Edible,Flower", "Flower", "Edible,Flower",
"Edible,Flower", "Concentrate,Flower", "Flower", "Concentrate",
"Flower", "Edible,Flower", "Flower", "Flower", "Flower", "Concentrate",
"Edible,Flower", "Concentrate", "Flower", "Flower", "Concentrate,Flower",
"Edible,Flower", "Flower", "Flower", "Edible,Flower", "Concentrate,Flower",
"Concentrate", "Concentrate", "Concentrate", "Concentrate", "Edible,Flower",
"Flower", "Edible,Flower", "Flower", "Concentrate", "Flower",
"Concentrate,Flower", "Edible,Flower", "Flower", "Flower", "Flower",
"Flower", "Flower", "Flower", "Concentrate", "Flower", "Flower",
"Flower", "Flower", "Flower", "Flower", "Concentrate", "Concentrate",
"Flower", "Flower", "Flower", "Edible,Flower", "Concentrate",
"Flower", "Flower", "Flower", "Flower", "Flower", "Flower", "Flower",
"Flower", "Flower", "Flower", "Flower", "Flower", "Flower", "Flower",
"Flower", "Flower", "Flower", "Flower", "Concentrate", "Flower",
"Flower", "Flower", "Flower", "Flower", "Flower", "Flower", "Flower",
"Flower", "Flower", "Flower", "Concentrate", "Flower", "Concentrate,Flower",
"Flower", "Flower", "Flower", "Flower", "Flower", "Flower", "Flower",
"Flower", "Flower", "Flower", "Flower", "Flower", "Flower", "Concentrate",
"Flower", "Concentrate", "Flower", "Flower", "Flower", "Flower",
"Edible,Flower", "Flower", "Concentrate,Flower", "Concentrate,Flower",
"Flower", "Edible,Flower", "Flower", "Flower", "Flower", "Flower",
"Concentrate,Flower", "Concentrate", "Flower", "Flower", "Flower",
"Flower", "Flower", "Flower", "Concentrate,Flower", "Flower",
"Flower", "Flower", "Flower", "Concentrate", "Flower", "Flower",
"Concentrate", "Concentrate,Flower", "Flower", "Flower", "Flower",
"Edible,Flower", "Flower", "Flower", "Flower", "Flower", "Flower",
"Flower", "Flower", "Flower", "Concentrate", "Flower", "Flower",
"Flower", "Flower", "Flower", "Flower", "Flower", "Flower", "Flower",
"Flower", "Flower", "Flower", "Flower", "Flower", "Flower", "Flower",
"Flower", "Flower", "Edible,Flower", "Concentrate", "Flower",
"Flower", "Flower", "Flower", "Flower", "Flower", "Flower", "Flower",
"Concentrate,Flower", "Flower", "Flower", "Flower", "Flower",
"Edible,Flower")), row.names = c(NA, -200L), class = c("tbl_df",
"tbl", "data.frame"))
glimpse(interesting_basket_items5)
Rows: 200
Columns: 1
$ CATEGORY <chr> "Flower", "Flower", "Concentrate,Flower", "Flower", "Flower", "Flower", "Flower", "Edible,Flower", "Concentrat…
I don't know why the following code using the arules
package is not working as expected:
interesting_basket_items_list <- as(interesting_basket_items5, "transactions")
# mine association rules with the 'apriori' function
rules <- apriori(interesting_basket_items_list, parameter = list(support = 0.001, confidence = 0.05))
# sort the rules by lift
rules <- sort(rules, by = "lift")
# inspect the resulting rules
(rules <- inspect(rules))
The 'support' part of the output looks correct, as checked against my own logic here:
interesting_basket_items5 %>%
group_by(CATEGORY) %>%
tally() %>%
mutate(pct = n / sum(n))
But the confidence
part doesn't make sense to me.
I would think that the confidence of a rule for lhs = {Concentrate} -> rhs {Concentrate, Flower} would have a confidence of 2/3
or .67
because I am dividing 14
by 21
, which is the number of transactions containing Concentrate
and Flower
by the number of transaction containing Concentrate
In general, I can't understand why the lhs of these association rules is totally blank like this: {}
instead of showing a more interesting antecedent. The thresholds that I am using in the code should be inclusive: parameter = list(support = 0.001, confidence = 0.05))
and the structure of the input data as a list within a list "transactions dataset" I think is correct {Concentrate}
, {Concentrate, Flower}
etc
Shouldn't this rules
dataset have a lhs equal to Concentrate
and a rhs equal to {Concentrate, Flower}
with support = 0.070
and confidence = 0.67
?
I would like to be able to understand how this is working in terms of probability and conditional probability, in a way where I can demonstrate how it makes sense, start-to-finish, instead of taking data blindly into the apriori
package and trusting the output, so to speak.
I'd be happy with a solution that shows how to change the structure of data input into apriori
to get a result that makes sense, or a tuning of the arguments of the apriori
package, or else I would be happy with a solution that does this in tidyverse
, some way of implementing the probability that is support
and the conditional probability that is confidence
in a way that is comprehensive and easy to demonstrate in the code. I'm just not certain tidyverse
would easily be comprehensive of all possible permutations of baskets on the conditional probability part, which is probably why the apriori
package/algorithm was designed...
I'm able to get apriori
to run but I have my doubts it's really correct for what I want, and I don't know how to coerce anything to make the result more accurate/intuitive, and I also don't know how to reconstruct what apriori
is doing in tidyverse
, so I'm stuck. I'd appreciate some kind of answer that could bring this together.
The problem is with the data format: in each row you have just one string which is treated by apriori
as a transaction, consisting of a single item, where in fact it's a list of items, joined by ,
. Before feeding it to apriori
you have to split it:
basket_split <- sapply(
unlist(interesting_basket_items5),
function(x) strsplit(x, ",")
)
interesting_basket_items_list <- transactions(basket_split)
rules <- apriori(
interesting_basket_items_list,
parameter = list(supp = 0.001, conf = 0.05)
)
inspect(rules)
Then the output would look right:
lhs rhs support confidence coverage lift count
[1] {} => {Edible} 0.090 0.09000000 1.000 1.0000000 18
[2] {} => {Concentrate} 0.175 0.17500000 1.000 1.0000000 35
[3] {} => {Flower} 0.895 0.89500000 1.000 1.0000000 179
[4] {Edible} => {Flower} 0.090 1.00000000 0.090 1.1173184 18
[5] {Flower} => {Edible} 0.090 0.10055866 0.895 1.1173184 18
[6] {Concentrate} => {Flower} 0.070 0.40000000 0.175 0.4469274 14
[7] {Flower} => {Concentrate} 0.070 0.07821229 0.895 0.4469274 14