rrstudioglmrparallel

Parallelized code result in inflated memory usage in threads (RStudio deffect)


Overview:

my B object is a big matrix 100 000 * 5000 of 2 GB
my A object is smaller 1000 * 5000

analyse_with_glm <- function(Y) {
  cond1 = unlist(apply(B, 2, function(X) coef(summary(glm(Y~X)))[,4][2]))
  cond2 = unlist(apply(B, 2, function(X) coef(summary(glm(Y~X+cov2)))[,4][2]))
  cond3 = unlist(apply(B, 2, function(X) coef(summary(glm(Y~X+cov3)))[,4][2]))
  list(cond1, cond2, cond3)}

cl = makeCluster(nb_cpu, type = "FORK", outfile='outcluster.log')
res = parApply(cl, A, 2, analyse_with_glm)

Initially I have a single rsession process using 2.1GB of my mermoy.
After calling parApply function it I have nb_cpu threads of 4.5GB.

Two questions:

I use 'top' command to monitor thread and memory usage and this is not superficial usage that garbage collector can release. Threads crash for being out of memory. It run on a 128GB memory computer with 30 threads (nb_cpu = 30 in my code).

NB: I also tried contrary, using B (the big matrix) in parApply instead of A but it did not fix the issue.


Solution

  • This answer might be partial as I still consider R behavior weird when it comes to parallelizing code. If you run code from RStudio, parallel thread tend to be inflated by the size of ~/.rstudio/suspended-session-data/

    So to avoid it, here is a dummy workaround.
    1. Clean your environment
    2. Log-out
    3. Log-in
    4. Load your data
    5. Run parallel code

    INFO: