rfor-looplapplymclapply

replace nested for-loops on multidimensional array with mclapply()


I'm trying to perform an action over a 4-dimensional array. This array ends up being incredibly big, but is necessary for the data that i'm processing. Now the process itself goes swell, but i want to make it ready for parallel computing. I've got access to a 96-core mainframe and i want to use it.

So far i've read online that the easiest way to get this done is by using mclapply(), the parallelized version of lapply(). I know the basics of how lapply() works, but i can't quite figure out how to apply it in this situation.

I have a 4-dimensional array that's filled with NAs. Each dimension has dimnames. I want to compare the dimnames of dimension 1 with dimension 3 and dimension 2 with dimension 4 (this is done by a custom function that i wrote). If they all match up, a number comes out and i want that number to be entered into xy[i, k, j, l] where the letters i-l represent the indices for one entry.

In the example below i have simplified it into an addition of the nchar() values for the dimnames.

xy <- array(NA, dim = c(10, 10, 10, 10), dimnames = list(c("john", "sandra", "peter", "linda", "max", "sam", "ana", "enzo", "juan", "abe"), 
                                                          c("smith", "gonzalez", "doe", "dopi", "lincoln", "biden", "rutte", "merkel", "slim", "shady"),
                                                          c("jon", "sam", "pete", "melinda", "max", "sam", "anna", "carlo", "jiro", "abel"),
                                                          c("smitty", "rupinder", "dole", "mite", "lincolan", "bidet", "rourke", "meer", "smart", "sunny")))

for(i in 1:dim(xy)[1]){
    for(j in 1:dim(xy)[3]){
      for(k in 1:dim(xy)[2]){
        for(l in 1:dim(xy)[4]){
          a <- nchar(dimnames(xy)[[1]][i]) + nchar(dimnames(xy)[[3]][j])
          b <- nchar(dimnames(xy)[[2]][k]) + nchar(dimnames(xy)[[4]][l])
          if(!is.null(a) & !is.null(b)){
            xy[i, k, j, l] <- a + b
          }
        }
      }
    }
  }

my problem is that my output needs to be a multidimensional array. so far i've only used lapply() to output one list of values. How do i extend this to multiple dimensions?

I've already looked in these posts:

replace a nested for loop with mapply

replace nested foreach loops

but each of these solve the question in a way that does not help mine.


Solution

  • fun_on_names <- function(Var1, Var2, Var3, Var4){
     
     a <- nchar(Var1) + nchar(Var3)
     b <- nchar(Var2) + nchar(Var4)
     
     if(!is.null(a) & !is.null(b)) return(a + b)
     else return(NA)
     
    }
    
    xy[] <- do.call(parallel::mcmapply, 
                    c(list(FUN = fun_on_names, mc.cores = 96),
                      expand.grid(dimnames(xy), stringsAsFactors = FALSE)))
    

    The idea is:

    The function actually returns a numeric vector, but by keeping [] in xy[]<-, you are assigning the values back to xy by keeping intact the attributes of xy which makes it a multidimensional array.

    This solution does not work in parallel on Windows.

    do.call is needed no that each column of the data.frame (output of expand.grid) is treated by mcapply as individual vectors.

    You can see it as:

    df <- expand.grid(dimnames(xy), stringsAsFactors = FALSE)
    xy[] <- parallel::mcmapply(FUN = fun_on_names, 
                               mc.cores = 96,
                               df[[1]], df[[2]], df[[3]], df[[4]])