I'm trying to run a nested loop using the parallel() and foreach() packages to increase speed, but I'm not having success. I have a dataframe where the first column is a categorical data and the others are scalar. I want to run a function, such as caret::bagFDA, with two predictor variables and the categorical one as the response variable, so that all possible combinations are tested without repetition. So, what I'm looking for is "Categorical ~ Scalar Variable 1 + Scalar Variable 2", in a nested loop that all combinations are executed without repeating.
When I attempt to run a conventional nested loop, it works fine. Here's a reproducible example:
#Install packages
#install.packages("caret")
#install.packages("dplyr")
start.time <- Sys.time()
#Libraries
library(caret)
library(dplyr)
#Read iris data
data(iris)
iris <- iris[c(ncol(iris), 1:(ncol(iris) - 1))]
#Sampling 50/50 for train and valid
set.seed(2000)
Train <- iris %>% group_by(Species) %>% sample_frac(.5, replace = FALSE)
Valid <- anti_join(iris, Train)
# bagFDA
BAGFDA <- data.frame(Variable_Name1 = character(0), Variable_Name2 = character(0), Accuracy = numeric(0), Kappa = numeric(0))
set.seed(3000)
for(i in 2:(ncol(Train) - 1)) {
for (j in (i + 1):ncol(Train)) {
tryCatch({
formula <- as.formula(paste("as.factor(Species) ~", names(Train)[i], "+", names(Train)[j]))
bag <- caret::bagFDA(formula, data = Train)
bag_predict <- predict(bag, newdata = Valid)
bag_CM <- confusionMatrix(bag_predict, Valid$Species)
iteration_results <- data.frame(
Variable_Name1 = names(Train)[i],
Variable_Name2 = names(Train)[j],
Accuracy = bag_CM$overall["Accuracy"],
Kappa = bag_CM$overall["Kappa"]
)
BAGFDA <- rbind(BAGFDA, iteration_results)
print("Good")
}, error = function(e) {
cat("ERROR:", conditionMessage(e), "\n")
})
}
}
print(BAGFDA)
end.time <- Sys.time()
time.taken <- round(end.time - start.time,2)
time.taken
I'm not sure if this is the most optimized way to run the code, but it worked fine and took about 4 seconds. However, when I try to run it in parallel with the parallel() and foreach() packages, I can't achieve the same result. Here's the code I'm trying:
#Install packages
#install.packages("foreach")
#install.packages("doParallel")
#install.packages("caret")
#install.packages("dplyr")
start.time <- Sys.time()
#Libraries
library(foreach)
library(doParallel)
library(caret)
library(dplyr)
#Read iris data
data(iris)
iris <- iris[c(ncol(iris), 1:(ncol(iris) - 1))]
#Creating clusters
cores = parallel::detectCores() - 1
cluster = parallel::makeCluster(cores, type = "PSOCK")
doParallel::registerDoParallel(cluster)
if (!foreach::getDoParRegistered())
{
print("ERROR")
}
print(foreach::getDoParWorkers())
#Sampling 50/50 for train and valid
set.seed(2000)
Train <- iris %>% group_by(Species) %>% sample_frac(.5, replace = FALSE)
Valid <- anti_join(iris, Train)
# bagFDA
BAGFDA <- data.frame(Variable_Name1 = character(0), Variable_Name2 = character(0), Accuracy = numeric(0), Kappa = numeric(0))
set.seed(1001)
results <- foreach(i = 2:(ncol(Train) - 1), .combine='cbind') %:%
foreach (j = (i + 1):ncol(Train), .combine='c') %dopar%
tryCatch({
formula <- as.formula(paste("as.factor(Species) ~", names(Train)[i], "+", names(Train)[j]))
bag <- caret::bagFDA(formula, data = Train)
bag_predict <- predict(bag, newdata = Valid)
bag_CM <- confusionMatrix(bag_predict, Valid$Species)
iteration_results <- data.frame(
Variable_Name1 = names(Train)[i],
Variable_Name2 = names(Train)[j],
Accuracy = bag_CM$overall["Accuracy"],
Kappa = bag_CM$overall["Kappa"]
)
BAGFDA <- rbind(BAGFDA, iteration_results)
print("Good")
}, error = function(e) {
cat("ERROR:", conditionMessage(e), "\n")
})
print(BAGFDA)
stopCluster(cluster)
end.time <- Sys.time()
time.taken <- round(end.time - start.time,2)
time.taken
The run doesn't show any errors, but the dataframe BAGFDA isn't filled, the results is always NULL and the formula is always (unknown). Clearly, I'm doing something wrong, but I can't identify the error. Could you please help me? I'm following the guidelines provided in this document: https://cran.r-project.org/web/packages/foreach/vignettes/nested.html
Try this:
results <- foreach(i = 2:(ncol(Train) - 1), .combine=rbind) %:%
foreach (j = (i + 1):ncol(Train), .combine=rbind, .packages = c("caret")) %dopar% {
tryCatch({
formula <- as.formula(paste("Species ~", names(Train)[i], "+", names(Train)[j]))
bag <- caret::bagFDA(formula, data = Train)
bag_predict <- predict(bag, newdata = Valid)
bag_CM <- confusionMatrix(bag_predict, Valid$Species)
data.frame(
Variable_Name1 = names(Train)[i],
Variable_Name2 = names(Train)[j],
Accuracy = bag_CM$overall["Accuracy"],
Kappa = bag_CM$overall["Kappa"]
)
}, error = function(e) {
cat("ERROR:", conditionMessage(e), "\n")
})
}
I believe that the issue you were encountering was caused by the caret
package not being available on the workers. After fiddling around with the original code I got this error:
Error in { : task 1 failed - "could not find function "confusionMatrix""
that suggested this might be the problem.
Passing the .packages = c("caret")
argument ensures that the package is available to the works.
The result of the above code is:
Variable_Name1 Variable_Name2 Accuracy Kappa
Accuracy Sepal.Length Sepal.Width 0.7837838 0.6755275
Accuracy1 Sepal.Length Petal.Length 0.9729730 0.9594521
Accuracy2 Sepal.Length Petal.Width 0.9594595 0.9391614
Accuracy3 Sepal.Width Petal.Length 0.9729730 0.9594521
Accuracy11 Sepal.Width Petal.Width 0.9594595 0.9391614
Accuracy4 Petal.Length Petal.Width 0.9594595 0.9391614
Also note that I'm using .combine=rbind
in both foreach()
commands. This means that you don't need to create the results data frame (BAGFDA
) prior to entering foreach()
and also avoids calling rbind()
within the loops which, as pointed out already in the comments, can have a performance impact if you are working with larger data (presuming that you might be applying this to something other than iris
).
One thing to be aware of is that the set.seed(1001)
might not be having the intended effect. Run the following a few times and you'll see that you get different random numbers on each run. The reason being that the seed set in the master will not propagate to each of the workers.
set.seed(1001)
foreach(i = 2:(ncol(Train) - 1)) %:%
foreach (j = (i + 1):ncol(Train), .packages = c("caret")) %dopar% {
runif(3)
}