rperformancebigdatana

Is there a faster way to find the first value that is not NA in a large vector using base R?


Just like the question says. Is there a faster way to do what is done below when the vector size is very large (> 10M entries) using base R in the general case?

The code below works, but it does not short-circuit, so when the vector size grows it becomes slow for what should be obvious reasons. In this particular example a for loop might be faster but if the first NA value is very far from the start of the vector maybe not...

set.seed(1)
x <- c(rep(NA, 3), sample(c(T,F), size=5e7, replace=T))
min(which(!is.na(x))) #4

Solution

  • An Rcpp for loop should be fast even if the first non-NA value is not particularly near the start. The basic idea is that you do not need to keep iterating after you have found the first value.

    Rcpp::cppFunction("int which_non_na(LogicalVector x) {
        R_xlen_t n = x.size();
        for (R_xlen_t i = 0; i < n; ++i) {
            if (!LogicalVector::is_na(x[i])) {
                return i + 1; // +1 for 1-indexing in R
            }
        }
        return NA_INTEGER; // if no non-NA values are found
    }")
    
    which_non_na(x) # 4
    

    Benchmarks

    Here are some benchmarks for some vectors of length 1e4 to 1e8 and between 1e3 and 1e6 non-NA values. With a vector of length 10,000 and 1,000 non-NA values it is around 50 times faster than the base R approach in the question. For the shortest vectors, the approach is around the same speed (or even slightly slower than) the pure R nonNA() function in the answer by Carl Witthoft. We see the benefits with the longest vectors and the most non-NA values (so we can exit earlier).

    With a vector of length 100 million and 1 million non-NA values, the median time of the Rcpp approach is around 600,000 times faster than the approach in the question. The pure R nonNA() function falls further behind Rcpp (although it's still decent and only 7 times as slow). The pure R min(which.min(x), which.max(x), Inf) approach by ThomasIsCoding is pretty close with a relative speed of 1.86 compared with Rcpp.

    set.seed(1)
    nonNA <- \(x) { for (jn in 1:length(x)) { if (!is.na(x[jn])) { return(jn) } }; NA }
    results <- bench::press(
        vec_length = 10^(4:8),
        num_non_na = c(1e3, 1e5, 1e6),
        {
            # create vector with non-NA
            if (num_non_na >= vec_length) num_non_na <- vec_length - 1
            x <- sample(c(rep(c(TRUE, FALSE), length.out = num_non_na), rep(NA, vec_length - num_non_na)))
    
            bench::mark(
                relative = TRUE,
                base_r = min(which(!is.na(x))),
                which_min = min(which.min(x), which.max(x), Inf),
                which_1 = which(!is.na(x))[1],
                position = Position(\(x) !is.na(x), x),
                non_na = nonNA(x),
                rcpp = which_non_na(x)
            )
        }
    )
    

    Output:

       expression vec_length num_non_na       min    median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
       <bch:expr>      <dbl>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>    <dbl> <int> <dbl>   <bch:tm>
     1 base_r          10000       1000     54.2      57.8       1          Inf      Inf  9978    22   193.91ms
     2 which_min       10000       1000      1.63      1.66     36.6        NaN      NaN 10000     0     5.31ms
     3 which_1         10000       1000     53.0      55.9       1.16       Inf      Inf  9977    23   167.24ms
     4 position        10000       1000      2.86      3.11     20.0        NaN      NaN 10000     0     9.71ms
     5 non_na          10000       1000      1         1        51.3        Inf      NaN 10000     0     3.79ms
     6 rcpp            10000       1000      1.06      1.07     47.3        NaN      NaN 10000     0     4.11ms
     7 base_r         100000       1000    326.      353.        1.00       Inf      Inf  3449    83   397.84ms
     8 which_min      100000       1000      1.73      1.73    200.         NaN      NaN 10000     0     5.79ms
     9 which_1        100000       1000    315.      358.        1          Inf      Inf  3366    82   389.22ms
    10 position       100000       1000      7.68      8.26     41.4        NaN      Inf  9999     1    27.96ms
    11 non_na         100000       1000      2.35      2.52    137.         NaN      NaN 10000     0     8.42ms
    12 rcpp           100000       1000      1         1       345.         NaN      NaN 10000     0     3.35ms
    13 base_r        1000000       1000   1257.     1514.        1.00       Inf      Inf   223    82   302.37ms
    14 which_min     1000000       1000      2.76      2.73    629.         NaN      NaN 10000     0     21.6ms
    15 which_1       1000000       1000   1278.     1491.        1          Inf      Inf   226    76   306.97ms
    16 position      1000000       1000    634.      643.        2.68       NaN      Inf   917    23   463.92ms
    17 non_na        1000000       1000    134.      143.       12.0        NaN      Inf  3787    41    429.6ms
    18 rcpp          1000000       1000      1         1      1720.         NaN      NaN 10000     0      7.9ms
    19 base_r       10000000       1000  10613.    11683.        1          Inf      Inf    25    25   518.79ms
    20 which_min    10000000       1000      3.23      3.23   3493.         NaN      NaN 10000     0    59.41ms
    21 which_1      10000000       1000  10445.    10436.        1.10       Inf      Inf    27    27   510.68ms
    22 position     10000000       1000    882.      921.       11.1        NaN      Inf   269    22   500.75ms
    23 non_na       10000000       1000    189.      205.       49.8        NaN      Inf  1201    42   500.07ms
    24 rcpp         10000000       1000      1         1     11100.         NaN      NaN 10000     0     18.7ms
    25 base_r      100000000       1000  39833.    39765.        1.01       Inf      Inf     3     6   552.89ms
    26 which_min   100000000       1000     10.3      10.3    3955.         NaN      NaN 10000     0   468.35ms
    27 which_1     100000000       1000  40767.    40599.        1          Inf      Inf     3     6   555.68ms
    28 position    100000000       1000   1020.     1095.       34.2        NaN      Inf    93    21   503.13ms
    29 non_na      100000000       1000    211.      230.      164.         NaN      Inf   444    43   500.34ms
    30 rcpp        100000000       1000      1         1     40816.         NaN      NaN 10000     0    45.38ms
    31 base_r          10000     100000     57.8      74.9       1          Inf      Inf  9998     2   295.12ms
    32 which_min       10000     100000      1.53      1.54     60.2        NaN      NaN 10000     0      4.9ms
    33 which_1         10000     100000     47.3      50.9       1.93       Inf      Inf  9998     2      153ms
    34 position        10000     100000      2.85      2.97     32.3        NaN      NaN 10000     0     9.13ms
    35 non_na          10000     100000      1         1        92.9        NaN      NaN 10000     0     3.18ms
    36 rcpp            10000     100000      1.06      1.07     85.3        NaN      NaN 10000     0     3.46ms
    37 base_r         100000     100000    573.      593.        1          Inf      Inf  2640     5   489.41ms
    38 which_min      100000     100000      1.55      1.55    379.         NaN      NaN 10000     0      4.9ms
    39 which_1        100000     100000    453.      479.        1.31       Inf      Inf  3453     7   488.38ms
    40 position       100000     100000      2.87      2.98    202.         NaN      NaN 10000     0     9.16ms
    41 non_na         100000     100000      1         1       578.         NaN      NaN 10000     0     3.21ms
    42 rcpp           100000     100000      1.06      1.06    578.         NaN      NaN 10000     0     3.21ms
    43 base_r        1000000     100000   5830.     6048.        1          Inf      Inf   244     5   482.78ms
    44 which_min     1000000     100000      1.48      1.49   3903.         NaN      NaN 10000     0     5.07ms
    45 which_1       1000000     100000   5717.     5937.        1.06       Inf      Inf   261     4   487.61ms
    46 position      1000000     100000     13.0      13.3     463.         NaN      Inf  9998     2    42.69ms
    47 non_na        1000000     100000      3.48      3.65   1660.         NaN      NaN 10000     0    11.92ms
    48 rcpp          1000000     100000      1         1      5784.         NaN      Inf  9999     1     3.42ms
    49 base_r       10000000     100000  40802.    40182.        1          Inf      Inf    24     7   367.53ms
    50 which_min    10000000     100000      1.85      1.83  22655.         NaN      NaN 10000     0     6.76ms
    51 which_1      10000000     100000  40334.    40123.        1.02       Inf      Inf    23    10   344.07ms
    52 position     10000000     100000     92.3      91.3     471.         NaN      Inf  9984    16   324.88ms
    53 non_na       10000000     100000     20.3      21.1    2043.         NaN      Inf  9993     7    74.92ms
    54 rcpp         10000000     100000      1         1     42212.         NaN      NaN 10000     0     3.63ms
    55 base_r      100000000     100000 374396.   366226.        1.05       Inf      Inf     3     6   548.73ms
    56 which_min   100000000     100000      4.40      4.35  88633.         NaN      NaN 10000     0    21.62ms
    57 which_1     100000000     100000 373680.   366947.        1          Inf      Inf     3     6   574.97ms
    58 position    100000000     100000    359.      364.     1029.         NaN      Inf  2685    23   500.11ms
    59 non_na      100000000     100000     75.7      80.2    4240.         NaN      Inf 10000    38   452.07ms
    60 rcpp        100000000     100000      1         1    387345.         NaN      NaN 10000     0     4.95ms
    61 base_r          10000    1000000     59.1      63.5       1          Inf      Inf  9998     2   210.55ms
    62 which_min       10000    1000000      1.55      1.55     42.9        NaN      NaN 10000     0     4.91ms
    63 which_1         10000    1000000     45.6      50.6       1.42       Inf      Inf  9998     2   148.62ms
    64 position        10000    1000000      2.85      2.94     23.3        NaN      NaN 10000     0     9.03ms
    65 non_na          10000    1000000      1         1        66.8        NaN      NaN 10000     0     3.15ms
    66 rcpp            10000    1000000      1.06      1.06     64.7        NaN      NaN 10000     0     3.25ms
    67 base_r         100000    1000000    538.      562.        1          Inf      Inf  2801     6   489.45ms
    68 which_min      100000    1000000      1.46      1.55    247.         NaN      NaN 10000     0     7.08ms
    69 which_1        100000    1000000    432.      456.        1.22       Inf      Inf  3404     7   488.07ms
    70 position       100000    1000000      2.71      2.91    185.         NaN      NaN 10000     0     9.44ms
    71 non_na         100000    1000000      1         1       520.         NaN      NaN 10000     0     3.36ms
    72 rcpp           100000    1000000      1.06      1.07    504.         NaN      NaN 10000     0     3.47ms
    73 base_r        1000000    1000000   5571.     5799.        1          Inf      Inf   260     6   481.33ms
    74 which_min     1000000    1000000      1.54      1.52   3729.         NaN      NaN 10000     0     4.96ms
    75 which_1       1000000    1000000   4557.     4637.        1.29       Inf      Inf   336     7   481.23ms
    76 position      1000000    1000000      2.84      2.89   2004.         NaN      NaN 10000     0     9.24ms
    77 non_na        1000000    1000000      1         1      5494.         NaN      NaN 10000     0     3.37ms
    78 rcpp          1000000    1000000      1.06      1.02   5732.         NaN      NaN 10000     0     3.23ms
    79 base_r       10000000    1000000  69156.    69066.        1          Inf      Inf    17     5   377.72ms
    80 which_min    10000000    1000000      1.46      1.47  43589.         NaN      NaN 10000     0      5.1ms
    81 which_1      10000000    1000000  67884.    66517.        1.05       Inf      Inf    19     5   400.27ms
    82 position     10000000    1000000      4.03      4.16  15632.         NaN      NaN 10000     0    14.21ms
    83 non_na       10000000    1000000      1.32      1.31  44276.         NaN      NaN 10000     0     5.02ms
    84 rcpp         10000000    1000000      1         1     66408.         NaN      Inf  9999     1     3.35ms
    85 base_r      100000000    1000000 662166.   619018.        1.04       Inf      Inf     3     4   635.95ms
    86 which_min   100000000    1000000      1.90      1.86 326808.         NaN      NaN 10000     0     6.72ms
    87 which_1     100000000    1000000 589090.   557688.        1          Inf      Inf     3     3   659.17ms
    88 position    100000000    1000000     31.4      31.0   19264.         NaN      Inf 10000     4   114.06ms
    89 non_na      100000000    1000000      7.14      7.59  74081.         NaN      Inf 10000     2    29.66ms
    90 rcpp        100000000    1000000      1         1     62977.         NaN      Inf 10000     1    34.89ms
    

    Plot of results (log time scale):

    enter image description here