I have a dataframe with two grouping variables class
and group
. For each class, I have a plotting task per group.
Mostly, I have 2 levels per class
and 500 levels per group
.
I'm using parallel
package for parallelization and mclapply
function for the iteration through class
and group
levels.
I'm wondering which is the best way to write my iterations. I think I have two options:
class
variable.group
variable.My computer has 3 cores working for R session and usually, preserve the 4th core for my Operating System. I was wondering that if perform the parallelization for class
variable with 2 levels, the 3rd core will never will be used, so I thought that would be more efficient ensuring all 3 cores will be working running the parallelization for group
variable. I've written some speed tests to be sure which is the best way:
library(microbenchmark)
library(parallel)
f = function(class, group, A, B) {
mclapply(seq(class), mc.cores = A, function(z) {
mclapply(seq(group), mc.cores = B, function(c) {
ifelse(class == 1, 'plotA', 'plotB')
})
})
}
class = 2
group = 500
microbenchmark(
up = f(class, group, 3, 1),
nest = f(class, group, 1, 3),
times = 50L
)
Unit: milliseconds
expr min lq mean median uq max neval
up 6.751193 7.897118 10.89985 9.769894 12.26880 26.87811 50
nest 16.584382 18.999863 25.54437 22.293591 28.60268 63.49878 50
Result tells that I should use the parallelization for class
and not for group
variable.
The overview would be that I always should write one-core functions and then call it for parallelization. I think this way, my code would be more simple or reductionist, than write nested functions with parallelization capabilities.
The ifelse
condition is used because the previous code used to prepare the data for plotting task is more or less redundant for both class
levels, so I thought it would be more line-coding efficient write a longer function checking which class
level is used than "splitting" this function in two shorter functions.
Which is the best practice to write this kind of code?. I seams clear, but because I'm not an expert data-scientist, I would like to know your working approach.
This threat is around this problem. But I think that my question is for both points of view:
Thanks
You asked this a while ago but I'll attempt an answer in case anyone else was wondering the same thing. First, I like to split up my task first and then loop over each part. This gives me more control over the process.
parts <- split(df, c(df$class, df$group))
mclapply(parts, some_function)
Second, distributing tasks to multiple cores takes a lot of computational overhead and can cancel out any gains your make from paralleizing your script. Here, mclapply
splits the job into however many nodes you have and performs the fork once. This is much more efficient than nesting two mclapply
loops.