Just like the question says. Is there a faster way to do what is done below when the vector size is very large (> 10M entries) using base R in the general case?
The code below works, but it does not short-circuit, so when the vector size grows it becomes slow for what should be obvious reasons. In this particular example a for loop might be faster but if the first NA value is very far from the start of the vector maybe not...
set.seed(1)
x <- c(rep(NA, 3), sample(c(T,F), size=5e7, replace=T))
min(which(!is.na(x))) #4
An Rcpp
for loop should be fast even if the first non-NA
value is not particularly near the start. The basic idea is that you do not need to keep iterating after you have found the first value.
Rcpp::cppFunction("int which_non_na(LogicalVector x) {
R_xlen_t n = x.size();
for (R_xlen_t i = 0; i < n; ++i) {
if (!LogicalVector::is_na(x[i])) {
return i + 1; // +1 for 1-indexing in R
}
}
return NA_INTEGER; // if no non-NA values are found
}")
which_non_na(x) # 4
Here are some benchmarks for some vectors of length 1e4
to 1e8
and between 1e3
and 1e6
non-NA
values. With a vector of length 10,000 and 1,000 non-NA values it is around 50 times faster than the base R approach in the question. For the shortest vectors, the approach is around the same speed (or even slightly slower than) the pure R nonNA()
function in the answer by Carl Witthoft. We see the benefits with the longest vectors and the most non-NA
values (so we can exit earlier).
With a vector of length 100 million and 1 million non-NA
values, the median time of the Rcpp
approach is around 600,000 times faster than the approach in the question. The pure R nonNA()
function falls further behind Rcpp
(although it's still decent and only 7 times as slow). The pure R min(which.min(x), which.max(x), Inf)
approach by ThomasIsCoding is pretty close with a relative speed of 1.86
compared with Rcpp
.
set.seed(1)
nonNA <- \(x) { for (jn in 1:length(x)) { if (!is.na(x[jn])) { return(jn) } }; NA }
results <- bench::press(
vec_length = 10^(4:8),
num_non_na = c(1e3, 1e5, 1e6),
{
# create vector with non-NA
if (num_non_na >= vec_length) num_non_na <- vec_length - 1
x <- sample(c(rep(c(TRUE, FALSE), length.out = num_non_na), rep(NA, vec_length - num_non_na)))
bench::mark(
relative = TRUE,
base_r = min(which(!is.na(x))),
which_min = min(which.min(x), which.max(x), Inf),
which_1 = which(!is.na(x))[1],
position = Position(\(x) !is.na(x), x),
non_na = nonNA(x),
rcpp = which_non_na(x)
)
}
)
Output:
expression vec_length num_non_na min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <bch:tm>
1 base_r 10000 1000 54.2 57.8 1 Inf Inf 9978 22 193.91ms
2 which_min 10000 1000 1.63 1.66 36.6 NaN NaN 10000 0 5.31ms
3 which_1 10000 1000 53.0 55.9 1.16 Inf Inf 9977 23 167.24ms
4 position 10000 1000 2.86 3.11 20.0 NaN NaN 10000 0 9.71ms
5 non_na 10000 1000 1 1 51.3 Inf NaN 10000 0 3.79ms
6 rcpp 10000 1000 1.06 1.07 47.3 NaN NaN 10000 0 4.11ms
7 base_r 100000 1000 326. 353. 1.00 Inf Inf 3449 83 397.84ms
8 which_min 100000 1000 1.73 1.73 200. NaN NaN 10000 0 5.79ms
9 which_1 100000 1000 315. 358. 1 Inf Inf 3366 82 389.22ms
10 position 100000 1000 7.68 8.26 41.4 NaN Inf 9999 1 27.96ms
11 non_na 100000 1000 2.35 2.52 137. NaN NaN 10000 0 8.42ms
12 rcpp 100000 1000 1 1 345. NaN NaN 10000 0 3.35ms
13 base_r 1000000 1000 1257. 1514. 1.00 Inf Inf 223 82 302.37ms
14 which_min 1000000 1000 2.76 2.73 629. NaN NaN 10000 0 21.6ms
15 which_1 1000000 1000 1278. 1491. 1 Inf Inf 226 76 306.97ms
16 position 1000000 1000 634. 643. 2.68 NaN Inf 917 23 463.92ms
17 non_na 1000000 1000 134. 143. 12.0 NaN Inf 3787 41 429.6ms
18 rcpp 1000000 1000 1 1 1720. NaN NaN 10000 0 7.9ms
19 base_r 10000000 1000 10613. 11683. 1 Inf Inf 25 25 518.79ms
20 which_min 10000000 1000 3.23 3.23 3493. NaN NaN 10000 0 59.41ms
21 which_1 10000000 1000 10445. 10436. 1.10 Inf Inf 27 27 510.68ms
22 position 10000000 1000 882. 921. 11.1 NaN Inf 269 22 500.75ms
23 non_na 10000000 1000 189. 205. 49.8 NaN Inf 1201 42 500.07ms
24 rcpp 10000000 1000 1 1 11100. NaN NaN 10000 0 18.7ms
25 base_r 100000000 1000 39833. 39765. 1.01 Inf Inf 3 6 552.89ms
26 which_min 100000000 1000 10.3 10.3 3955. NaN NaN 10000 0 468.35ms
27 which_1 100000000 1000 40767. 40599. 1 Inf Inf 3 6 555.68ms
28 position 100000000 1000 1020. 1095. 34.2 NaN Inf 93 21 503.13ms
29 non_na 100000000 1000 211. 230. 164. NaN Inf 444 43 500.34ms
30 rcpp 100000000 1000 1 1 40816. NaN NaN 10000 0 45.38ms
31 base_r 10000 100000 57.8 74.9 1 Inf Inf 9998 2 295.12ms
32 which_min 10000 100000 1.53 1.54 60.2 NaN NaN 10000 0 4.9ms
33 which_1 10000 100000 47.3 50.9 1.93 Inf Inf 9998 2 153ms
34 position 10000 100000 2.85 2.97 32.3 NaN NaN 10000 0 9.13ms
35 non_na 10000 100000 1 1 92.9 NaN NaN 10000 0 3.18ms
36 rcpp 10000 100000 1.06 1.07 85.3 NaN NaN 10000 0 3.46ms
37 base_r 100000 100000 573. 593. 1 Inf Inf 2640 5 489.41ms
38 which_min 100000 100000 1.55 1.55 379. NaN NaN 10000 0 4.9ms
39 which_1 100000 100000 453. 479. 1.31 Inf Inf 3453 7 488.38ms
40 position 100000 100000 2.87 2.98 202. NaN NaN 10000 0 9.16ms
41 non_na 100000 100000 1 1 578. NaN NaN 10000 0 3.21ms
42 rcpp 100000 100000 1.06 1.06 578. NaN NaN 10000 0 3.21ms
43 base_r 1000000 100000 5830. 6048. 1 Inf Inf 244 5 482.78ms
44 which_min 1000000 100000 1.48 1.49 3903. NaN NaN 10000 0 5.07ms
45 which_1 1000000 100000 5717. 5937. 1.06 Inf Inf 261 4 487.61ms
46 position 1000000 100000 13.0 13.3 463. NaN Inf 9998 2 42.69ms
47 non_na 1000000 100000 3.48 3.65 1660. NaN NaN 10000 0 11.92ms
48 rcpp 1000000 100000 1 1 5784. NaN Inf 9999 1 3.42ms
49 base_r 10000000 100000 40802. 40182. 1 Inf Inf 24 7 367.53ms
50 which_min 10000000 100000 1.85 1.83 22655. NaN NaN 10000 0 6.76ms
51 which_1 10000000 100000 40334. 40123. 1.02 Inf Inf 23 10 344.07ms
52 position 10000000 100000 92.3 91.3 471. NaN Inf 9984 16 324.88ms
53 non_na 10000000 100000 20.3 21.1 2043. NaN Inf 9993 7 74.92ms
54 rcpp 10000000 100000 1 1 42212. NaN NaN 10000 0 3.63ms
55 base_r 100000000 100000 374396. 366226. 1.05 Inf Inf 3 6 548.73ms
56 which_min 100000000 100000 4.40 4.35 88633. NaN NaN 10000 0 21.62ms
57 which_1 100000000 100000 373680. 366947. 1 Inf Inf 3 6 574.97ms
58 position 100000000 100000 359. 364. 1029. NaN Inf 2685 23 500.11ms
59 non_na 100000000 100000 75.7 80.2 4240. NaN Inf 10000 38 452.07ms
60 rcpp 100000000 100000 1 1 387345. NaN NaN 10000 0 4.95ms
61 base_r 10000 1000000 59.1 63.5 1 Inf Inf 9998 2 210.55ms
62 which_min 10000 1000000 1.55 1.55 42.9 NaN NaN 10000 0 4.91ms
63 which_1 10000 1000000 45.6 50.6 1.42 Inf Inf 9998 2 148.62ms
64 position 10000 1000000 2.85 2.94 23.3 NaN NaN 10000 0 9.03ms
65 non_na 10000 1000000 1 1 66.8 NaN NaN 10000 0 3.15ms
66 rcpp 10000 1000000 1.06 1.06 64.7 NaN NaN 10000 0 3.25ms
67 base_r 100000 1000000 538. 562. 1 Inf Inf 2801 6 489.45ms
68 which_min 100000 1000000 1.46 1.55 247. NaN NaN 10000 0 7.08ms
69 which_1 100000 1000000 432. 456. 1.22 Inf Inf 3404 7 488.07ms
70 position 100000 1000000 2.71 2.91 185. NaN NaN 10000 0 9.44ms
71 non_na 100000 1000000 1 1 520. NaN NaN 10000 0 3.36ms
72 rcpp 100000 1000000 1.06 1.07 504. NaN NaN 10000 0 3.47ms
73 base_r 1000000 1000000 5571. 5799. 1 Inf Inf 260 6 481.33ms
74 which_min 1000000 1000000 1.54 1.52 3729. NaN NaN 10000 0 4.96ms
75 which_1 1000000 1000000 4557. 4637. 1.29 Inf Inf 336 7 481.23ms
76 position 1000000 1000000 2.84 2.89 2004. NaN NaN 10000 0 9.24ms
77 non_na 1000000 1000000 1 1 5494. NaN NaN 10000 0 3.37ms
78 rcpp 1000000 1000000 1.06 1.02 5732. NaN NaN 10000 0 3.23ms
79 base_r 10000000 1000000 69156. 69066. 1 Inf Inf 17 5 377.72ms
80 which_min 10000000 1000000 1.46 1.47 43589. NaN NaN 10000 0 5.1ms
81 which_1 10000000 1000000 67884. 66517. 1.05 Inf Inf 19 5 400.27ms
82 position 10000000 1000000 4.03 4.16 15632. NaN NaN 10000 0 14.21ms
83 non_na 10000000 1000000 1.32 1.31 44276. NaN NaN 10000 0 5.02ms
84 rcpp 10000000 1000000 1 1 66408. NaN Inf 9999 1 3.35ms
85 base_r 100000000 1000000 662166. 619018. 1.04 Inf Inf 3 4 635.95ms
86 which_min 100000000 1000000 1.90 1.86 326808. NaN NaN 10000 0 6.72ms
87 which_1 100000000 1000000 589090. 557688. 1 Inf Inf 3 3 659.17ms
88 position 100000000 1000000 31.4 31.0 19264. NaN Inf 10000 4 114.06ms
89 non_na 100000000 1000000 7.14 7.59 74081. NaN Inf 10000 2 29.66ms
90 rcpp 100000000 1000000 1 1 62977. NaN Inf 10000 1 34.89ms
Plot of results (log time scale):