rmacosmemoryparallel-processingmclapply

mclapply hangs when using multiple instances back to back


I'm trying to create a glove model with the data from the kaggle reddit comments challenge. I load the table, pull the body, and now I'm trying to clean the text.

I pulled a small subset (100000 titles) to experiment with, and this is what I have so far:

library(DBI)
require(RSQLite)
library(dplyr)
library(parallel)
library(progress)
library(textclean)

titles = as.character(df$body)
numcores = detectCores()

i = 1
temp = {}
out = {}
while(i <= 100000){
  temp = titles[i:(i+1000)] %>%
    mclapply(replace_emoji, mc.cores = numcores) %>%
    mclapply(replace_url, mc.cores = numcores) %>%
    mclapply(replace_contraction, mc.cores = numcores) %>%
    mclapply(gsub, pattern = "[^[:alnum:][:space:]]",replacement = "") %>% 
    mclapply(replace_number, mc.cores = numcores) 
  i = i+1000
  out = c(out, temp)
  print(i)
}

But it seems to bet hung in random places. It doesn't cause an error, it just stops. When I look at my activity monitor, I see the CPU usage just drop and never recover.

I don't know what I would need to provide to make this request easier to decompose, so please let me know, and I'll edit it in.

Am I using mclapply wrong?

Im using a mac 16 GB i7, with 8 cores.

Edit: I have looked around and found answers like this and this but they did not help me. Also, it seems to work if I just use lapply.


Solution

  • Nested loops caused a problem. One repetition of the parallel loop should not be waiting for the other loop repetition to proceed. A deadlock is occurred in case the parallel loop is determined to be repeated sequentially.

    Parallel work does not always produce good efficiency.