mclapply encounters errors depending on core id?

I have a set of genes for which I need to calculate some coefficients in parallel. Coefficients are calculated inside GeneTo_GeneCoeffs_filtered that takes gene name as an input and returns the list of 2 data frames.

Having 100-length gene_array I ran this command with the different number of cores: 5, 6 and 7.

Coeffslist=mclapply(gene_array,GeneTo_GeneCoeffs_filtered,mc.cores = no_cores)

I encounter errors on different gene names depending on the number of cores assigned to mclapply.

Indexes of genes on which GeneTo_GeneCoeffs_filtered cannot return the list of data frames they have a pattern. In the case of 7 cores assigned to mclapply, it is 4, 11, 18, 25, ... 95 elements of gene_array (every 7th), and when R works with 6 cores indexes are 2, 8, 14,..., 98 (every 6th) and the same way with 5 cores - every 5th.

The most important thing is that they are different for these processes and it means that the problem is not in particular genes.

I suspect there might be "broken" core that cannot properly run my functions and only it generates this errors. Is there a way to trace back its id and exclude it from the list of cores that can be used by R?

Solution

A close reading of mclapply's manpage reveals that this behavior is by design and it arises as result of interaction between:

(a)

"the input X is split into as many parts as there are cores (currently the values are spread across the cores sequentially, i.e. first value to core 1, second to core 2, ... (core + 1)-th value to core 1 etc.) and then one process is forked to each core and the results are collected."

(b)

a "try-error" object will be returned for all the values involved in the failure, even if not all of them failed.

In your case, by virtue of (a), your gene_array is spread "round-robin" style across the cores (with a gap of mc.cores between the indexes of successive elements), and by virtue of (b), if any gene_array element raises an error, you get back an error for each gene_array element sent to that core (having a gap of mc.cores between the indices of those elements).

I refreshed my understanding of this in an exchange yesterday with Simon Urbanek: https://stat.ethz.ch/pipermail/r-sig-hpc/2019-September/002098.html in which I also provide an error-handling approach yielding errors only for the indices that generate an error.

You can also get errors only for the indices that generate an error by passing mc.preschedule=FALSE.