rfor-loopoptimizationresampling

Bootstrapping in R - each sample comprising of multiple rows


With an example dataframe pay, I am bootstrapping using base R. The main difference from classical bootstrapping is that a sample can have multiple rows which must all be included.

There are 7 ID's in pay, hence my goal is to create a sample of length 7 with replacement and create a new dataset resample containing the sampled ID's.

My code currently works but is inefficient given one million rows in my data and many repetitions required by bootstrap.

Creating pay:

ID    <- c(1,1,1,2,3,3,4,4,4,4)
level <-  c(1:10)
pay <- data.frame(ID = ID,level =  level)

My (inefficient) code for creating a single resampled dataset:

IDs <- levels(as.factor(ID))
samp <- sample(IDs, length(IDs) , replace = TRUE)
resample <- numeric(0)

for (i in 1:length(IDs))        
    {
temp <-  pay[pay$ID == samp[i], ]
resample <- rbind(resample, temp) 
    }

Result:

 samp
[1] "1" "2" "3" "1"


 resample
  ID level
1  1   0.5
2  1  -2.0
3  1   3.0
4  2   4.0
5  3   5.0
6  3   6.0
7  1   0.5
8  1  -2.0
9  1   3.0

I think the slowest part is extending resample with every iteration. However, I do not know how many rows there will be at the end. Thanks a lot for your help.


Solution

  • You can sample the rows by doing

    pay[sample(seq_len(nrow(pay)), replace=TRUE),]
    

    It seems fairly efficient.

    > system.time({
    +   for (i in 1:10000)
    +     pay[sample(seq_len(nrow(pay)), replace=TRUE),]
    + })
       user  system elapsed
      0.469   0.002   0.473
    

    Edit:

    Per Dudelstein's comment below, the above is incorrect. Here's a way to address what I think you're asking for.

    samp <- sample(unique(ID), replace=TRUE)
    do.call(rbind, lapply(samp, function(x) pay[pay$ID == x,]))
    

    Benchmarking, it seems to be a third faster (roughly) compared to the original method. I'm sure there's a better way.