rparallel-processingsnowparallel-foreach

Working with multiple cores and sparse matrices in R


I am working on a project that requires large matrices with a larger number of zeros. Unfortunately, as some of these matrices can have more than 1e10 elements, working with the "standard" R matrices is not an option, due to RAM constraints. Also, I need to work on multiple cores, as the computation can take quite a long time and really shouldn't.

So far, I have been working with the foreach package, and converted the results (which come in standard matrices) to sparse matrices afterwards. I can't help but think that there must be a smarter way.

Here is a minimal example of what I have been doing so far:

cl <- makeSOCKcluster(8)
registerDoSNOW(cl)

Mat <- foreach(j=1:length(lambda), .combine='cbind') %dopar% { 
  replicate(iter, rpois(n=1, lambda[j]))
}
Mat <- Matrix(Mat, sparse=TRUE)
stopCluster(cl)

The lambdas are all quite small, so that only every 5th element or so is different from zero, making it sensible to store the results in a sparse matrix.

Unfortunately, it has now become necessary to increase the number of iterations from 1e6 to at least 1e7, so that the matrix that is produced by the foreach loop is too large to be stored on 8GB of RAM. What I now want to do is split up the tasks into steps that each have 1e6 iterations, and combine these into a single, sparse matrix.

I now have the following as an idea:

library(Matrix)
library(snow)

cl <- makeSOCKcluster(8)

iter <- 1e6

steps <- 1e5
numsteps <- iter / steps

draws <- function(x, lambda, steps){
  replicate(n=steps, rpois(n=1, lambda=lambda))
}

for(i in 1:numsteps){
  Mat <- Matrix(0, nrow=steps, ncol=96, sparse=TRUE)

  Mat <- Matrix(
    parApply(cl=cl, X=Mat, MARGIN=2, FUN=draws, lambda=0.2, steps=steps)
    , sparse = TRUE)

  if(!exists("fullmat")) fullmat <- Mat else fullmat <- rBind(fullmat, Mat)

  rm(Mat)   
}

stopCluster(cl)

It works fine, but I had to fix lambda to some value. For my application, I need the values in the ith row to come from a poisson distribution with mean equal to the ith element of the lambda vector. This obviously worked fine in the foreach loop., but I have yet to find a way to make it work in an apply loop.

My questions are:

  1. Is it possible to have the apply function "know" on which row it is operating and pass a corresponding argument to a function?
  2. Is there a way to work with foreach and sparse matrices without the need of creating a standard matrix and converting it into a sparse one in the next step?
  3. If none of the above, is there a way for me to manually assign tasks to slave processes of R - that is, could I specifically tell a process to work on column 1, another to work on column 2 and so on, each creating a sparse vector and only combining these in the last step.

Solution

  • I was able to find a solution to my problem.

    In my case, I am able to define a unique ID for each of the columns, and can address the parameters by that. The following code should illustrate what I mean:

    library(snow)
    library(Matrix)
    
    iter <- 1e6
    steps <- 1e5
    
    # define a unique id
    SZid <- seq(from=1, to=10, by=1)
    
    # in order to have reproducible code, generate random parameters
    SZlambda <- replicate(runif(n=1, min=0, max=.5))
    SZmu <- replicate(runif(n=1, min=10, max=15))
    SZsigma <- replicate(runif(n=1, min=1, max=3))
    
    cl <- makeSOCKcluster(8)
    clusterExport(cl, list=c("SZlambda", "SZmu", "SZsigma"))
    
    numsteps <- iter / steps
    
    MCSZ <- function(SZid, steps){ # Monte Carlo Simulation 
      lambda <- SZlambda[SZid]; mu <- SZmu[SZid]; sigma <- SZsigma[SZid];
    
      replicate(steps, sum(rlnorm(meanlog=mu, sdlog=sigma, 
                                  n = rpois(n=1, lambda))
                           ))
    }
    
    for (i in 1:numsteps){
      Mat <- Matrix(
        parSapply(cl, X=SZid, FUN=MCSZ, steps=steps), sparse=TRUE)
    
      if(!exists("LossSZ")) LossSZ <- Mat else LossSZ <- rBind(LossSZ, Mat)
      rm(Mat)
    }
    
    stopCluster(cl)
    

    The trick is to apply the function not over the matrix, but over a vector of unique ids that line up with the indices of the parameters.