rforeachcorrelationr-future

How to convert foreach into a function?


I am using a foreach to calculate the correlation coefficients and p values, using the mtcars as an example ( foreach is overkill here but the dataframe I'm using has 450 obs for 3400 variables). I use combn to get rid of duplicate correlations and self-correlations.

combo_cars <- data.frame(t(combn(names(mtcars),2)))

library(foreach)
cars_res <-  foreach(i=1:nrow(combo_cars), .combine=rbind, .packages=c("magrittr", "dplyr"))     %dopar% {
  out2 <-  broom::tidy(cor.test(mtcars[, combo_cars[i,1]],
                                mtcars[,combo_cars[i,2]],
                                method = "spearman")) %>% 
    mutate(Var1=combo_cars[i,1], Var2=combo_cars[i,2])
}

I would like to convert this into a function, as I would like to try using the future package because I need to run correlations on subsections of the original dataframe and its more efficient them running in parallel. When trying to devise a function that replicates the above, I can use:

car_res2 <- data.frame(t(combn(names(mtcars), 2, function(x)  
  cor.test(mtcars[[x[1]]],
           mtcars[[x[2]]], method="spearman"), simplify=TRUE)))

Ultimately I would like to be able to have four futures running in parallel, each computing the above on a different fraction of the dataset.

However, the car_res2 output has 8 columns instead of 7 (the second one is completely empty). I had to use the output from the cars_res to know what the values were and these were in the order of statistic, blank, p-value, estimate etc, whilst the car_res had labelled columns with estimate, statistic, p value.

  1. was wondering why the output is in different orders and not labelled with the second approach?
  2. can I use one of the apply functions in place of the above function?

Any comments would be appreciated.


Solution

  • Without parallelization you can try RcppAlgos::comboGeneral first, which works very similar to combn but is implemented in C++ and therefore may be faster (it also has a Parallel= option, however it is ignored when FUN is used). Moreover I don't load broom and dplyr.

    res <- RcppAlgos::comboGeneral(names(mtcars), 2, FUN=\(x) {
      data.frame(cor.test(mtcars[, x[1]], mtcars[, x[2]], method="spearman")[c(4, 1, 3, 7, 6)], t(x))
    }, Parallel=TRUE, nThreads=7) |> do.call(what=rbind) |> `rownames<-`(NULL)
    
    head(res)
    #     estimate statistic      p.value                          method alternative  X1   X2
    # 1 -0.9108013 10425.332 4.690287e-13 Spearman's rank correlation rho   two.sided mpg  cyl
    # 2 -0.9088824 10414.862 6.370336e-13 Spearman's rank correlation rho   two.sided mpg disp
    # 3 -0.8946646 10337.290 5.085969e-12 Spearman's rank correlation rho   two.sided mpg   hp
    # 4  0.6514555  1901.659 5.381347e-05 Spearman's rank correlation rho   two.sided mpg drat
    # 5 -0.8864220 10292.319 1.487595e-11 Spearman's rank correlation rho   two.sided mpg   wt
    # 6  0.4669358  2908.399 7.055765e-03 Spearman's rank correlation rho   two.sided mpg qsec
    

    Alternatively, if you're on Linux (or Mac, but not tested), you could use parallel::mclapply, which works like lapply but with multiple cores, and use combn beforehand. This gives you the freedom to choose an arbitrary subset of combinations.

    ncomb <- as.data.frame(combn(names(mtcars), 2))
    
    parallel::mclapply(ncomb[, c(1:2, 11:12)], \(x) {
      data.frame(cor.test(mtcars[, x[1]], mtcars[, x[2]], method="spearman")[c(4, 1, 3, 7, 6)], t(x)) 
    }, mc.cores=7) |> do.call(what=rbind) |> `rownames<-`(NULL)
    #     estimate  statistic      p.value                          method alternative  X1   X2
    # 1 -0.9108013 10425.3320 4.690287e-13 Spearman's rank correlation rho   two.sided mpg  cyl
    # 2 -0.9088824 10414.8622 6.370336e-13 Spearman's rank correlation rho   two.sided mpg disp
    # 3  0.9276516   394.7330 2.275443e-14 Spearman's rank correlation rho   two.sided cyl disp
    # 4  0.9017909   535.8287 1.867686e-12 Spearman's rank correlation rho   two.sided cyl   hp
    

    On Windows you can use parallel::parLapply.

    library(parallel)
    
    CL <- makeCluster(detectCores() - 1)
    clusterExport(CL, c('ncomb', 'mtcars'))  ## `mtcars` symbolizes you data
    
    parLapply(CL, ncomb[, c(1:2, 11:12)], \(x) {
      data.frame(cor.test(mtcars[, x[1]], mtcars[, x[2]], method="spearman")[c(4, 1, 3, 7, 6)], t(x)) 
    }) |> do.call(what=rbind) |> `rownames<-`(NULL)
    
    stopCluster(CL)
    

    See this answer for more details on the use of parLapply vs mclapply.