rforeachparallel-processingnested-loopsr-caret

Parallel processing for nested loops isn't working correctly in R


I'm trying to run a nested loop using the parallel() and foreach() packages to increase speed, but I'm not having success. I have a dataframe where the first column is a categorical data and the others are scalar. I want to run a function, such as caret::bagFDA, with two predictor variables and the categorical one as the response variable, so that all possible combinations are tested without repetition. So, what I'm looking for is "Categorical ~ Scalar Variable 1 + Scalar Variable 2", in a nested loop that all combinations are executed without repeating.

When I attempt to run a conventional nested loop, it works fine. Here's a reproducible example:

#Install packages
#install.packages("caret")
#install.packages("dplyr")

start.time <- Sys.time()
#Libraries
library(caret)
library(dplyr)

#Read iris data
data(iris)
iris <- iris[c(ncol(iris), 1:(ncol(iris) - 1))]

#Sampling 50/50 for train and valid
set.seed(2000)
Train <- iris %>% group_by(Species) %>% sample_frac(.5, replace = FALSE)
Valid <- anti_join(iris, Train)


# bagFDA
BAGFDA <- data.frame(Variable_Name1 = character(0), Variable_Name2 = character(0), Accuracy = numeric(0), Kappa = numeric(0))

set.seed(3000)
for(i in 2:(ncol(Train) - 1)) {
  for (j in (i + 1):ncol(Train)) {
    tryCatch({
      formula <- as.formula(paste("as.factor(Species) ~", names(Train)[i], "+", names(Train)[j]))
      bag <- caret::bagFDA(formula, data = Train)
      bag_predict <- predict(bag, newdata = Valid)
      bag_CM <- confusionMatrix(bag_predict, Valid$Species)
      iteration_results <- data.frame(
        Variable_Name1 = names(Train)[i],
        Variable_Name2 = names(Train)[j],
        Accuracy = bag_CM$overall["Accuracy"],
        Kappa = bag_CM$overall["Kappa"]
      )
      BAGFDA <- rbind(BAGFDA, iteration_results)
      print("Good")
    }, error = function(e) {
      cat("ERROR:", conditionMessage(e), "\n")
    })
  }
}
print(BAGFDA)
end.time <- Sys.time()
time.taken <- round(end.time - start.time,2)
time.taken

I'm not sure if this is the most optimized way to run the code, but it worked fine and took about 4 seconds. However, when I try to run it in parallel with the parallel() and foreach() packages, I can't achieve the same result. Here's the code I'm trying:

#Install packages
#install.packages("foreach")
#install.packages("doParallel")
#install.packages("caret")
#install.packages("dplyr")

start.time <- Sys.time()
#Libraries
library(foreach)
library(doParallel)
library(caret)
library(dplyr)

#Read iris data
data(iris)
iris <- iris[c(ncol(iris), 1:(ncol(iris) - 1))]

#Creating clusters
cores = parallel::detectCores() - 1
cluster = parallel::makeCluster(cores, type = "PSOCK")
doParallel::registerDoParallel(cluster)
if (!foreach::getDoParRegistered()) 
{ 
  print("ERROR")
}
print(foreach::getDoParWorkers()) 

#Sampling 50/50 for train and valid
set.seed(2000)
Train <- iris %>% group_by(Species) %>% sample_frac(.5, replace = FALSE)
Valid <- anti_join(iris, Train)

# bagFDA
BAGFDA <- data.frame(Variable_Name1 = character(0), Variable_Name2 = character(0), Accuracy = numeric(0), Kappa = numeric(0))

set.seed(1001)
results <- foreach(i = 2:(ncol(Train) - 1), .combine='cbind') %:%
  foreach (j = (i + 1):ncol(Train), .combine='c') %dopar%
    tryCatch({
      formula <- as.formula(paste("as.factor(Species) ~", names(Train)[i], "+", names(Train)[j]))
      bag <- caret::bagFDA(formula, data = Train)
      bag_predict <- predict(bag, newdata = Valid)
      bag_CM <- confusionMatrix(bag_predict, Valid$Species)
      iteration_results <- data.frame(
        Variable_Name1 = names(Train)[i],
        Variable_Name2 = names(Train)[j],
        Accuracy = bag_CM$overall["Accuracy"],
        Kappa = bag_CM$overall["Kappa"]
      )
      BAGFDA <- rbind(BAGFDA, iteration_results)
      print("Good")
    }, error = function(e) {
      cat("ERROR:", conditionMessage(e), "\n")
    })

print(BAGFDA)
stopCluster(cluster)
end.time <- Sys.time()
time.taken <- round(end.time - start.time,2)
time.taken

The run doesn't show any errors, but the dataframe BAGFDA isn't filled, the results is always NULL and the formula is always (unknown). Clearly, I'm doing something wrong, but I can't identify the error. Could you please help me? I'm following the guidelines provided in this document: https://cran.r-project.org/web/packages/foreach/vignettes/nested.html


Solution

  • Try this:

    results <- foreach(i = 2:(ncol(Train) - 1), .combine=rbind) %:%
      foreach (j = (i + 1):ncol(Train), .combine=rbind, .packages = c("caret")) %dopar% {
        tryCatch({
          formula <- as.formula(paste("Species ~", names(Train)[i], "+", names(Train)[j]))
          bag <- caret::bagFDA(formula, data = Train)
          bag_predict <- predict(bag, newdata = Valid)
          bag_CM <- confusionMatrix(bag_predict, Valid$Species)
          
          data.frame(
            Variable_Name1 = names(Train)[i],
            Variable_Name2 = names(Train)[j],
            Accuracy = bag_CM$overall["Accuracy"],
            Kappa = bag_CM$overall["Kappa"]
          )
        }, error = function(e) {
          cat("ERROR:", conditionMessage(e), "\n")
        })
      }
    

    I believe that the issue you were encountering was caused by the caret package not being available on the workers. After fiddling around with the original code I got this error:

    Error in { : task 1 failed - "could not find function "confusionMatrix""
    

    that suggested this might be the problem.

    Passing the .packages = c("caret") argument ensures that the package is available to the works.

    The result of the above code is:

               Variable_Name1 Variable_Name2  Accuracy     Kappa
    Accuracy     Sepal.Length    Sepal.Width 0.7837838 0.6755275
    Accuracy1    Sepal.Length   Petal.Length 0.9729730 0.9594521
    Accuracy2    Sepal.Length    Petal.Width 0.9594595 0.9391614
    Accuracy3     Sepal.Width   Petal.Length 0.9729730 0.9594521
    Accuracy11    Sepal.Width    Petal.Width 0.9594595 0.9391614
    Accuracy4    Petal.Length    Petal.Width 0.9594595 0.9391614
    

    Also note that I'm using .combine=rbind in both foreach() commands. This means that you don't need to create the results data frame (BAGFDA) prior to entering foreach() and also avoids calling rbind() within the loops which, as pointed out already in the comments, can have a performance impact if you are working with larger data (presuming that you might be applying this to something other than iris).

    One thing to be aware of is that the set.seed(1001) might not be having the intended effect. Run the following a few times and you'll see that you get different random numbers on each run. The reason being that the seed set in the master will not propagate to each of the workers.

    set.seed(1001)
    foreach(i = 2:(ncol(Train) - 1)) %:%
      foreach (j = (i + 1):ncol(Train), .packages = c("caret")) %dopar% {
        runif(3)
      }