In a simulation, we need ordered data which is a random sample (with or without replacement) of size m from a full data set of size n. Unfortunately, ordering the sampled data turns out to be a performance bottleneck in our simulations. The problem is that the sampling is repeated R times, which results in a runtime complexity of O(R m log(m)). We are seeking to reduce the runtime complexity by calling sort()
only once, before all sampling:
n <- 500; m <- 100; R <- 1000
all.data <- runif(n)
all.data <- sort(all.data) # the full data set is already sorted
for (r in 1:R) {
indices <- sample(1:n, m)
sample.data <- sort(all.data[indices]) # this call to sort should be avoided
}
We therefore wonder whether it is possible to sort the full data set only once and then subsequently to directly obtain ordered samples by sampling from the ordered data (without ordering the indices returned by sample()
).
Does anyone have a suggestion how to do this in R (we do not necessarily need R code, but the general approach would also be sufficient)?
For sampling with replacement (see below for references and a simulation check):
frexp <- function(m, R, x) {
y <- rexp(R)
for (r in 1:R) {
cs <- cumsum(rexp(m))
sample.data <- x[ceiling(cs*length(x)/(cs[m] + y[r]))]
}
}
For sampling without replacement:
fbool2 <- function(m, R, x) {
n <- length(x)
b <- logical(n)
for (r in 1:R) {
b[i <- sample(n, m)] <- TRUE
sample.data <- x[b]
b[i] <- FALSE
}
}
Other functions for benchmarking, including one that doesn't bother to sort:
fNoSort <- function(m, R, x) for (r in 1:R) sample.data <- sample(x, m)
fsort <- function(m, R, x) for (r in 1:R) sample.data <- sort(sample(x, m))
fbool1 <- function(m, R, x) { # from @lotus
for (r in 1:R) sample.data <- x[sample(rep.int(!0:1, c(m, length(x) - m)))]
}
Benchmarking:
microbenchmark::microbenchmark(
fNoSort = fNoSort(m, R, all.data),
fsort = fsort(m, R, all.data),
fbool1 = fbool1(m, R, all.data),
fbool2 = fbool2(m, R, all.data),
frexp = frexp(m, R, all.data)
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> fNoSort 8.3386 8.52895 9.746159 8.80505 9.37925 19.4995 100
#> fsort 35.6090 38.05500 40.689749 39.67485 41.73925 79.2346 100
#> fbool1 29.2213 29.64240 31.838542 30.65475 32.70145 40.3151 100
#> fbool2 10.2221 10.43100 11.616368 10.79390 12.22930 19.7826 100
#> frexp 5.0672 5.16110 6.028450 5.30465 5.81935 16.2895 100
frexp
is faster even than sample
without the sort.
Quick simulation to check that the indices used in frexp
constitute a (sorted) uniform random sample with replacement:
(seed <- sample(.Machine$integer.max, 1))
#> [1] 904968452
set.seed(seed)
n <- 20 # sample the numbers 1 through 20
N <- 1e6 # number of samples
tabulate(sample(n, N, 1), n)
#> [1] 50176 50058 50270 50279 50208 49722 50005 49668 49856 50196 49890 49798
#> [13] 50163 49489 49918 49698 50052 50240 50333 49981
tabulate(ceiling((cs <- cumsum(rexp(N)))*n/(cs[N] + rexp(1))), n)
#> [1] 49918 49883 49957 50063 49815 50226 49792 49970 49879 49788 50119 49965
#> [13] 50417 49832 50059 50281 49930 49904 50416 49786
References: