I'm trying to create a glove model with the data from the kaggle reddit comments challenge. I load the table, pull the body, and now I'm trying to clean the text.
I pulled a small subset (100000 titles) to experiment with, and this is what I have so far:
library(DBI)
require(RSQLite)
library(dplyr)
library(parallel)
library(progress)
library(textclean)
titles = as.character(df$body)
numcores = detectCores()
i = 1
temp = {}
out = {}
while(i <= 100000){
temp = titles[i:(i+1000)] %>%
mclapply(replace_emoji, mc.cores = numcores) %>%
mclapply(replace_url, mc.cores = numcores) %>%
mclapply(replace_contraction, mc.cores = numcores) %>%
mclapply(gsub, pattern = "[^[:alnum:][:space:]]",replacement = "") %>%
mclapply(replace_number, mc.cores = numcores)
i = i+1000
out = c(out, temp)
print(i)
}
But it seems to bet hung in random places. It doesn't cause an error, it just stops. When I look at my activity monitor, I see the CPU usage just drop and never recover.
I don't know what I would need to provide to make this request easier to decompose, so please let me know, and I'll edit it in.
Am I using mclapply wrong?
Im using a mac 16 GB i7, with 8 cores.
Edit: I have looked around and found answers like this and this but they did not help me. Also, it seems to work if I just use lapply.
Nested loops caused a problem. One repetition of the parallel loop should not be waiting for the other loop repetition to proceed. A deadlock is occurred in case the parallel loop is determined to be repeated sequentially.
Parallel work does not always produce good efficiency.