I'm trying to run multiple iterations of a function using a different subset of of my dataframe each time. In reality the function takes a very long time, so I want to split the iterations across multiple cores using mclapply
. For each iteration I'm using sample
to randomly select a subset of the dataframe, and this is inside the function I have written to give to mclapply
. However, the results of each of the iterations in the output list are identical, suggesting that mclapply
is not re-running the sample
lines each time. This must be something to do with how I have written the code, any ideas where I have gone wrong?
Here is a reproducible example of a small dataset that runs quickly. You will notice that the 10 iterations in the d.val.all
output list are identical, which is not what I am after.
library(bipartite)
library(doBy)
library(parallel)
# create dummy data
ecto.matrix1=data.frame(replicate(10,sample(0:80,81,rep=TRUE)),Species.mix.90=c(sample(c("R","M","S","B"),81,rep=TRUE)))
# set up the function
funct.resample.d <- function(i) {
RedSites <- row.names(ecto.matrix1)[ecto.matrix1$Species.mix.90=="R"]
MountainSites <- row.names(ecto.matrix1)[ecto.matrix1$Species.mix.90=="M"]
randomSilverSites <- sample(row.names(ecto.matrix1)[ecto.matrix1$Species.mix.90=="S"],8,replace=F)
randomBlackSites <- sample(row.names(ecto.matrix1)[ecto.matrix1$Species.mix.90=="B"],8,replace=F)
resampledSites <- c(RedSites,MountainSites,randomSilverSites,randomBlackSites) # make vector of the site names
matrix=ecto.matrix1[resampledSites,] # select only those rows from the resampled row names
matrix1 = matrix[,colSums(matrix[,-c(ncol(matrix))]) > 0] # drop cols that sum to 0
matrix2=summaryBy(matrix1[,-c(ncol(matrix1))]~Species.mix.90,data=matrix1,FUN=sum)
for (col in 1:ncol(matrix2)){
colnames(matrix2)[col] <- sub(".sum", "", colnames(matrix2)[col]) # remove the sum bit from the col names
}
row.names(matrix2)<-matrix2$Species.mix.90 # make row names
matrix2=subset(matrix2, select=-c(Species.mix.90)) # drop host col
d.val <- dfun(matrix2)$dprime
}
# run mclapply
reps=c(1:10)
d.val.all <- mclapply(reps, funct.resample.d, mc.cores = 10)
In case anyone else is having similar issues, I figured out that the problem was with the summaryBy
function rather than sample
. I replaced summaryBy
with aggregate
, and the randomization worked fine.
matrix2=aggregate(. ~ Species.mix.90, matrix1, sum)