I have a huge excel file to manipulate. I need to read colors and styles of a big number of cells and thought to speed up calculations by parallelising tasks. I'm relying on the xlsx
package and its function getCellStyle
to grab the style cell by cell. That package, in turn, relies on rJava
. It looks like that, for some reason, tasks involving java objects can not be parallelised. Here a reproducible example:
require(xlsx)
require(writexl)
require(doParallel)
require(foreach)
require(parallel)
#We create an excel file with the iris dataset
filename <- "iris.xlsx"
write_xlsx(iris, filename)
#Read the workbook and the first (and only) sheet
wb <- loadWorkbook(filename)
sheet <- getSheets(wb)[[1]]
#With the next two rows we grab all the cells as Java objects
rows <- getRows(sheet)
allcells <- getCells(rows)
#This works: grabbing the style
styles <- lapply(allcells, getCellStyle)
styles[[1]]
#[1] "Java-Object{org.apache.poi.xssf.usermodel.XSSFCellStyle@abd07bb0}"
#Now we try to go parallel: we create a cluster and make
#use of foreach and dopar
registerDoParallel(6)
stylePar<-foreach(i = seq_along(allcells)) %dopar% getCellStyle(allcells[[i]])
#Unfortunately, every Java object looks null
stylePar[[1]]
#[1] "Java-Object<null>"
#For the record, even mclapply returns all Java null objects
#mclapply(allcells, getCellStyle, mc.cores = 6, mc.preschedule = FALSE)
Am I missing something or it's inherently impossible to use foreach
with Java objects? Consider that I'm just reading values and not setting them.
As other solutions not pointing this out we have to state that it is impossible to do.
I find the same topic from 8 years ago on the R mailing list.
https://stat.ethz.ch/pipermail/r-devel/2013-November/067960.html
"It’s a limitation of the Java runtime - you cannot fork a JVM."
Other source: how to fork JVM?
So this is not a limitation of R, more of JVM.
Solution under library(future.apply)
is working as any plan was activated so the base lapply
was used. Should be invoked like:
library(future)
library(future.apply)
plan(multisession)
Last thing to comment is that multiprocessing is not so trivial and could be working under different paradigms. Please check out the future vignette https://cran.r-project.org/web/packages/future/vignettes/future-1-overview.html. You could find out synchronous and asynchronous processing. More than that each has own distinctions. For me is important to remember that we have multisession
and multicore
(multithreating
) which are working by far differently.