rparallel-processingmclapply

How to write efficient nested functions for parallelization?


I have a dataframe with two grouping variables class and group. For each class, I have a plotting task per group. Mostly, I have 2 levels per class and 500 levels per group.

I'm using parallel package for parallelization and mclapply function for the iteration through class and group levels.

I'm wondering which is the best way to write my iterations. I think I have two options:

  1. Run parallelization for class variable.
  2. Run parallelization for group variable.

My computer has 3 cores working for R session and usually, preserve the 4th core for my Operating System. I was wondering that if perform the parallelization for class variable with 2 levels, the 3rd core will never will be used, so I thought that would be more efficient ensuring all 3 cores will be working running the parallelization for group variable. I've written some speed tests to be sure which is the best way:

library(microbenchmark)
library(parallel)

f = function(class, group, A, B) {
  
  mclapply(seq(class), mc.cores = A, function(z) {
    mclapply(seq(group), mc.cores = B, function(c) {
      ifelse(class == 1, 'plotA', 'plotB')
    })
  })
  
}

class = 2
group = 500

microbenchmark(
  up = f(class, group, 3, 1),
  nest = f(class, group, 1, 3),
  times = 50L
)

Unit: milliseconds
 expr       min        lq     mean    median       uq      max neval
   up  6.751193  7.897118 10.89985  9.769894 12.26880 26.87811    50
 nest 16.584382 18.999863 25.54437 22.293591 28.60268 63.49878    50

Result tells that I should use the parallelization for class and not for group variable.

The overview would be that I always should write one-core functions and then call it for parallelization. I think this way, my code would be more simple or reductionist, than write nested functions with parallelization capabilities.

The ifelse condition is used because the previous code used to prepare the data for plotting task is more or less redundant for both class levels, so I thought it would be more line-coding efficient write a longer function checking which class level is used than "splitting" this function in two shorter functions.

Which is the best practice to write this kind of code?. I seams clear, but because I'm not an expert data-scientist, I would like to know your working approach.

This threat is around this problem. But I think that my question is for both points of view:

Thanks


Solution

  • You asked this a while ago but I'll attempt an answer in case anyone else was wondering the same thing. First, I like to split up my task first and then loop over each part. This gives me more control over the process.

    parts <- split(df, c(df$class, df$group))
    mclapply(parts, some_function)
    

    Second, distributing tasks to multiple cores takes a lot of computational overhead and can cancel out any gains your make from paralleizing your script. Here, mclapply splits the job into however many nodes you have and performs the fork once. This is much more efficient than nesting two mclapply loops.