rmclapply

Using "sample" within mclapply in R not working properly


I'm trying to run multiple iterations of a function using a different subset of of my dataframe each time. In reality the function takes a very long time, so I want to split the iterations across multiple cores using mclapply. For each iteration I'm using sample to randomly select a subset of the dataframe, and this is inside the function I have written to give to mclapply. However, the results of each of the iterations in the output list are identical, suggesting that mclapply is not re-running the sample lines each time. This must be something to do with how I have written the code, any ideas where I have gone wrong?

Here is a reproducible example of a small dataset that runs quickly. You will notice that the 10 iterations in the d.val.all output list are identical, which is not what I am after.

library(bipartite)
library(doBy)
library(parallel)

# create dummy data
ecto.matrix1=data.frame(replicate(10,sample(0:80,81,rep=TRUE)),Species.mix.90=c(sample(c("R","M","S","B"),81,rep=TRUE)))

# set up the function
funct.resample.d <- function(i) {
  RedSites <- row.names(ecto.matrix1)[ecto.matrix1$Species.mix.90=="R"]
  MountainSites <- row.names(ecto.matrix1)[ecto.matrix1$Species.mix.90=="M"]
  randomSilverSites <- sample(row.names(ecto.matrix1)[ecto.matrix1$Species.mix.90=="S"],8,replace=F)
  randomBlackSites <- sample(row.names(ecto.matrix1)[ecto.matrix1$Species.mix.90=="B"],8,replace=F)
  resampledSites <- c(RedSites,MountainSites,randomSilverSites,randomBlackSites) # make vector of the site names
  matrix=ecto.matrix1[resampledSites,] # select only those rows from the resampled row names
  matrix1 = matrix[,colSums(matrix[,-c(ncol(matrix))]) > 0] # drop cols that sum to 0
  matrix2=summaryBy(matrix1[,-c(ncol(matrix1))]~Species.mix.90,data=matrix1,FUN=sum)
  for (col in 1:ncol(matrix2)){
    colnames(matrix2)[col] <-  sub(".sum", "", colnames(matrix2)[col]) # remove the sum bit from the col names
  }
  row.names(matrix2)<-matrix2$Species.mix.90 # make row names
  matrix2=subset(matrix2, select=-c(Species.mix.90)) # drop host col
  d.val <- dfun(matrix2)$dprime
}

# run mclapply
reps=c(1:10)
d.val.all <- mclapply(reps, funct.resample.d, mc.cores = 10)

Solution

  • In case anyone else is having similar issues, I figured out that the problem was with the summaryBy function rather than sample. I replaced summaryBy with aggregate, and the randomization worked fine.

    matrix2=aggregate(. ~ Species.mix.90, matrix1, sum)