rperformancebigdatana

Is there a faster way to find the first value that is not NA in a large vector using base R?


Just like the question says. Is there a faster way to do what is done below when the vector size is very large (> 10M entries) using base R?

The code below works, but when the vector size grows it becomes slow for what should be obvious reasons. In this particular example a for loop might be faster but if the first NA value is very far from the start of the vector maybe not...

set.seed(1)
x <- c(rep(NA, 3), sample(c(T,F), size=5e7, replace=T))
min(which(!is.na(x))) #4

Solution

  • An Rcpp for loop should be fast even if the first non-NA value is not particularly near the start. The basic idea is that you do not need to keep iterating after you have found the first value.

    Rcpp::cppFunction("
    int which_non_na(LogicalVector x) {
        R_xlen_t n = x.size();
        for (R_xlen_t i = 0; i < n; ++i) {
            if (!LogicalVector::is_na(x[i])) {
                return i + 1; // +1 for 1-indexing in R
            }
        }
        return NA_INTEGER; // if no non-NA values are found
    }
    ")
    
    which_non_na(x) # 4
    

    Benchmarks

    Here are some benchmarks for some vectors of length 1e3 to 1e8. For a vector of length 1,000 this is about 9 times faster than the R approach. For a vector of length 100,000,000, it is on average over a million times faster. I've added the other answers into the benchmark, too. which.min() is faster than min(which()) but for the largest vector here this is still about 200,000 times faster than that.

    bench::press(
        vec_length = 10^(3:8),
        {
            # set probability of NA (so it won't necessarily be early)
            prob_na <- min(0.9, log10(vec_length) / 10)
            probs <- c(prob_na, (1 - prob_na) / 2, (1 - prob_na) / 2)
    
            set.seed(1)
            x <- sample(c(NA, TRUE, FALSE), size = vec_length, replace = TRUE, prob = probs)
    
            bench::mark(
                relative = TRUE,
                base_r = min(which(!is.na(x))),
                which_min = which.min(is.na(x)),
                which_1 = which(!is.na(x))[1],
                rcpp = which_non_na(x)
            )
        }
    )
    

    Output:

       expression vec_length        min     median  `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
       <bch:expr>      <dbl>      <dbl>      <dbl>      <dbl>     <dbl>    <dbl> <int> <dbl>   <bch:tm>
     1 base_r           1000       7.77       9.04       1          Inf      Inf  9999     1    32.37ms
     2 which_min        1000       1.76       2.14       4.75       Inf      NaN 10000     0     6.82ms
     3 which_1          1000       6.69       7.60       1.34       Inf      NaN 10000     0    24.17ms
     4 rcpp             1000       1          1          9.77       NaN      NaN 10000     0     3.31ms
     5 base_r          10000      71.2      121.         1          Inf      Inf  9998     2   451.83ms
     6 which_min       10000      10.3       10.1       11.1        Inf      NaN 10000     0    40.78ms
     7 which_1         10000      63.6       59.0        1.99       Inf      Inf  9998     2   227.42ms
     8 rcpp            10000       1          1        117.         NaN      NaN 10000     0     3.87ms
     9 base_r         100000    1216.      1030.         1.03       Inf      Inf  1288     2   493.42ms
    10 which_min      100000     101.        99.5        7.46       Inf      Inf  8903     4   471.91ms
    11 which_1        100000    1159.      1002.         1          Inf      Inf  1252     2   494.89ms
    12 rcpp           100000       1          1        980.         NaN      NaN 10000     0     4.03ms
    13 base_r        1000000   11177.      9625.         1          Inf      Inf   132     3   483.53ms
    14 which_min     1000000     962.       970.         9.26       Inf      Inf  1226     6   485.01ms
    15 which_1       1000000   10905.      9247.         1.06       Inf      Inf   142     2   489.28ms
    16 rcpp          1000000       1          1       8581.         NaN      NaN 10000     0     4.27ms
    17 base_r       10000000  108934.     87149.         1.04       Inf      Inf    12     3   408.38ms
    18 which_min    10000000   13676.     17399.         5.78       Inf      Inf    75     5   458.28ms
    19 which_1      10000000  105301.     95828.         1          Inf      Inf    13     2   459.48ms
    20 rcpp         10000000       1          1      84229.         NaN      NaN 10000     0      4.2ms
    21 base_r      100000000 1238681.   1224252.         1          Inf      Inf     2     3   749.38ms
    22 which_min   100000000  222405.    224765.         5.39       Inf      Inf     8     4   555.68ms
    23 which_1     100000000 1067243.   1053431.         1.16       Inf      Inf     2     4   644.82ms
    24 rcpp        100000000       1          1    1116208.         NaN      NaN 10000     0     3.36ms