rsparklyrfpgrowth

Choosing support and confidence values with ml_fpgrowth in Sparklyr


I am trying to take some inspiration from this Kaggle script where the author is using arules to perform a market basket analysis in R. I am particularly interested in the section where they pass in a vector of confidence and support values and then plots the number of rules generated to help chose the optimal values to use rather than generating a massive number of rules.

enter image description here

I wish to try the same process but I am using sparklyr/spark with fpgrowth in R and I am struggling achieve the same output i.e. count of rules for each confidence and support value.

From the limited examples and documentation I believe I pass my transaction data to ml_fpgrowth with my confidence and support values. This function then generates a model which then needs to be passed to ml_association_rules to generate the rules.

# CONVERT TABLE TO TRANSACTION FORMAT
trans <- medical_tbl %>% 
  group_by(alt_claim_id) %>%
  summarise(items = collect_list(proc_cd))

# SUPPORT AND CONFIDENCE VALUES
supportLevels <- c(0.1, 0.05, 0.01, 0.005)
confidenceLevels <- c(0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1)

# EMPTY LISTS
model_sup10 <- vector("list", length = 9)
model_sup5 <- vector("list", length = 9)
model_sup1 <- vector("list", length = 9)
model_sup0.5 <- vector("list", length = 9)

# FP GROWTH ALGORITHM WITH A SUPPORT LEVEL OF 10%
for (i in 1:length(confidenceLevels)) {
  model_sup10[i] <- ml_fpgrowth(trans,
                                min_support = supportLevels[1],
                                min_confidence = confidenceLevels[i],
                                items_col = "items",
                                uid = random_string("fpgrowth_"))}

I tried checking some of the rules for one of the models above model_sup101 and I cannot extract any rules. From the code below I get the following errors

rules <- ml_association_rules(model_sup10[[1]][1])
Error: $ operator is invalid for atomic vectors

Can anyone help or even explain if this is possible with fpgrowth and what is the best way forward to achieve my goal of plotting the number of rules generated for each support/confidence pairing?


Solution

  • After some head banging with dplyr and sparklyr I managed to cobble the following together. If anyone has any feedback as to how I can improve on this code then please feel free to comment.

    # SUPPORT AND CONFIDENCE VALUES
    supportLevels <- c(0.1, 0.05, 0.01, 0.005)
    confidenceLevels <- c(0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1)
    
    # CREATE FUNCTION TO LOOP THROUGH SUPPORT AND CONFIDENCE LEVELS AND RETURN NUMBER OF RULES GENERATED
    testModelFunction <- function(i, j) {
      ml_fpgrowth(trans,
                  min_support = as.numeric(i),
                  min_confidence = as.numeric(j),
                  items_col = "items",
                  uid = random_string("fpgrowth_")) %>% 
        ml_association_rules() %>% 
        count(name = "rules") %>% 
        pull()
    }
    
    # CREATE A LIST TO STORE THE OUTPUT FROM testModelFunction
    l = list()
    n = 1
    
    for (i in supportLevels) {
      for (j in confidenceLevels) {
        message(paste(i, j))
        tryCatch({
          l[[n]] <- list(supportLevels = i, confidenceLevels = j, n_rules = testModelFunction(i, j))
        }, 
        error = function(e) {
          l[[n]] <- list(supportLevels = i, confidenceLevels = j, error = e)
        })
        n <- n + 1
      }
    }
    
    rbindlist(l, fill = T)