With an example dataframe pay
, I am bootstrapping using base R. The main difference from classical bootstrapping is that a sample can have multiple rows which must all be included.
There are 7 ID's in pay
, hence my goal is to create a sample of length 7 with replacement and create a new dataset resample
containing the sampled ID's.
My code currently works but is inefficient given one million rows in my data and many repetitions required by bootstrap.
Creating pay
:
ID <- c(1,1,1,2,3,3,4,4,4,4)
level <- c(1:10)
pay <- data.frame(ID = ID,level = level)
My (inefficient) code for creating a single resampled dataset:
IDs <- levels(as.factor(ID))
samp <- sample(IDs, length(IDs) , replace = TRUE)
resample <- numeric(0)
for (i in 1:length(IDs))
{
temp <- pay[pay$ID == samp[i], ]
resample <- rbind(resample, temp)
}
Result:
samp
[1] "1" "2" "3" "1"
resample
ID level
1 1 0.5
2 1 -2.0
3 1 3.0
4 2 4.0
5 3 5.0
6 3 6.0
7 1 0.5
8 1 -2.0
9 1 3.0
I think the slowest part is extending resample
with every iteration. However, I do not know how many rows there will be at the end. Thanks a lot for your help.
You can sample the rows by doing
pay[sample(seq_len(nrow(pay)), replace=TRUE),]
It seems fairly efficient.
> system.time({
+ for (i in 1:10000)
+ pay[sample(seq_len(nrow(pay)), replace=TRUE),]
+ })
user system elapsed
0.469 0.002 0.473
Edit:
Per Dudelstein's comment below, the above is incorrect. Here's a way to address what I think you're asking for.
samp <- sample(unique(ID), replace=TRUE)
do.call(rbind, lapply(samp, function(x) pay[pay$ID == x,]))
Benchmarking, it seems to be a third faster (roughly) compared to the original method. I'm sure there's a better way.