I'm trying to perform an action over a 4-dimensional array. This array ends up being incredibly big, but is necessary for the data that i'm processing. Now the process itself goes swell, but i want to make it ready for parallel computing. I've got access to a 96-core mainframe and i want to use it.
So far i've read online that the easiest way to get this done is by using mclapply(), the parallelized version of lapply(). I know the basics of how lapply() works, but i can't quite figure out how to apply it in this situation.
I have a 4-dimensional array that's filled with NAs. Each dimension has dimnames. I want to compare the dimnames of dimension 1 with dimension 3 and dimension 2 with dimension 4 (this is done by a custom function that i wrote). If they all match up, a number comes out and i want that number to be entered into xy[i, k, j, l] where the letters i-l represent the indices for one entry.
In the example below i have simplified it into an addition of the nchar() values for the dimnames.
xy <- array(NA, dim = c(10, 10, 10, 10), dimnames = list(c("john", "sandra", "peter", "linda", "max", "sam", "ana", "enzo", "juan", "abe"),
c("smith", "gonzalez", "doe", "dopi", "lincoln", "biden", "rutte", "merkel", "slim", "shady"),
c("jon", "sam", "pete", "melinda", "max", "sam", "anna", "carlo", "jiro", "abel"),
c("smitty", "rupinder", "dole", "mite", "lincolan", "bidet", "rourke", "meer", "smart", "sunny")))
for(i in 1:dim(xy)[1]){
for(j in 1:dim(xy)[3]){
for(k in 1:dim(xy)[2]){
for(l in 1:dim(xy)[4]){
a <- nchar(dimnames(xy)[[1]][i]) + nchar(dimnames(xy)[[3]][j])
b <- nchar(dimnames(xy)[[2]][k]) + nchar(dimnames(xy)[[4]][l])
if(!is.null(a) & !is.null(b)){
xy[i, k, j, l] <- a + b
}
}
}
}
}
my problem is that my output needs to be a multidimensional array. so far i've only used lapply() to output one list of values. How do i extend this to multiple dimensions?
I've already looked in these posts:
replace a nested for loop with mapply
but each of these solve the question in a way that does not help mine.
fun_on_names <- function(Var1, Var2, Var3, Var4){
a <- nchar(Var1) + nchar(Var3)
b <- nchar(Var2) + nchar(Var4)
if(!is.null(a) & !is.null(b)) return(a + b)
else return(NA)
}
xy[] <- do.call(parallel::mcmapply,
c(list(FUN = fun_on_names, mc.cores = 96),
expand.grid(dimnames(xy), stringsAsFactors = FALSE)))
The idea is:
expand.grid
a big data.frame of all the combination of names you have.fun_on_names
on each combinationxy
The function actually returns a numeric vector, but by keeping []
in xy[]<-
, you are assigning the values back to xy
by keeping intact the attributes of xy
which makes it a multidimensional array.
This solution does not work in parallel on Windows.
do.call
is needed no that each column of the data.frame (output of expand.grid
) is treated by mcapply
as individual vectors.
You can see it as:
df <- expand.grid(dimnames(xy), stringsAsFactors = FALSE)
xy[] <- parallel::mcmapply(FUN = fun_on_names,
mc.cores = 96,
df[[1]], df[[2]], df[[3]], df[[4]])