Just like the question says. Is there a faster way to do what is done below when the vector size is very large (> 10M entries) using base R?
The code below works, but when the vector size grows it becomes slow for what should be obvious reasons. In this particular example a for loop might be faster but if the first NA value is very far from the start of the vector maybe not...
set.seed(1)
x <- c(rep(NA, 3), sample(c(T,F), size=5e7, replace=T))
min(which(!is.na(x))) #4
An Rcpp
for loop should be fast even if the first non-NA
value is not particularly near the start. The basic idea is that you do not need to keep iterating after you have found the first value.
Rcpp::cppFunction("
int which_non_na(LogicalVector x) {
R_xlen_t n = x.size();
for (R_xlen_t i = 0; i < n; ++i) {
if (!LogicalVector::is_na(x[i])) {
return i + 1; // +1 for 1-indexing in R
}
}
return NA_INTEGER; // if no non-NA values are found
}
")
which_non_na(x) # 4
Here are some benchmarks for some vectors of length 1e3
to 1e8
. For a vector of length 1,000 this is about 9 times faster than the R approach. For a vector of length 100,000,000, it is on average over a million times faster. I've added the other answers into the benchmark, too. which.min()
is faster than min(which())
but for the largest vector here this is still about 200,000 times faster than that.
bench::press(
vec_length = 10^(3:8),
{
# set probability of NA (so it won't necessarily be early)
prob_na <- min(0.9, log10(vec_length) / 10)
probs <- c(prob_na, (1 - prob_na) / 2, (1 - prob_na) / 2)
set.seed(1)
x <- sample(c(NA, TRUE, FALSE), size = vec_length, replace = TRUE, prob = probs)
bench::mark(
relative = TRUE,
base_r = min(which(!is.na(x))),
which_min = which.min(is.na(x)),
which_1 = which(!is.na(x))[1],
rcpp = which_non_na(x)
)
}
)
Output:
expression vec_length min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <bch:tm>
1 base_r 1000 7.77 9.04 1 Inf Inf 9999 1 32.37ms
2 which_min 1000 1.76 2.14 4.75 Inf NaN 10000 0 6.82ms
3 which_1 1000 6.69 7.60 1.34 Inf NaN 10000 0 24.17ms
4 rcpp 1000 1 1 9.77 NaN NaN 10000 0 3.31ms
5 base_r 10000 71.2 121. 1 Inf Inf 9998 2 451.83ms
6 which_min 10000 10.3 10.1 11.1 Inf NaN 10000 0 40.78ms
7 which_1 10000 63.6 59.0 1.99 Inf Inf 9998 2 227.42ms
8 rcpp 10000 1 1 117. NaN NaN 10000 0 3.87ms
9 base_r 100000 1216. 1030. 1.03 Inf Inf 1288 2 493.42ms
10 which_min 100000 101. 99.5 7.46 Inf Inf 8903 4 471.91ms
11 which_1 100000 1159. 1002. 1 Inf Inf 1252 2 494.89ms
12 rcpp 100000 1 1 980. NaN NaN 10000 0 4.03ms
13 base_r 1000000 11177. 9625. 1 Inf Inf 132 3 483.53ms
14 which_min 1000000 962. 970. 9.26 Inf Inf 1226 6 485.01ms
15 which_1 1000000 10905. 9247. 1.06 Inf Inf 142 2 489.28ms
16 rcpp 1000000 1 1 8581. NaN NaN 10000 0 4.27ms
17 base_r 10000000 108934. 87149. 1.04 Inf Inf 12 3 408.38ms
18 which_min 10000000 13676. 17399. 5.78 Inf Inf 75 5 458.28ms
19 which_1 10000000 105301. 95828. 1 Inf Inf 13 2 459.48ms
20 rcpp 10000000 1 1 84229. NaN NaN 10000 0 4.2ms
21 base_r 100000000 1238681. 1224252. 1 Inf Inf 2 3 749.38ms
22 which_min 100000000 222405. 224765. 5.39 Inf Inf 8 4 555.68ms
23 which_1 100000000 1067243. 1053431. 1.16 Inf Inf 2 4 644.82ms
24 rcpp 100000000 1 1 1116208. NaN NaN 10000 0 3.36ms