rna

na.omit is not removing NAs


I am trying to remove NAs in R. I have tried to replicate a simple example I have found multiple places online but am getting an unexpected output. I cannot find the error through searching online. What am I doing wrong?
I am using R version 4.3.2. I have restarted R and cleared the global variables (and restarted R again) and consistently get this result with anything I try.

a <- c(1,2,NA,3,4,NA,5,6)
b<- na.omit(a)
b

The output is

[1] 1 2 3 4 5 6
attr(,"na.action")
[1] 3 6
attr(,"class")
[1] "omit"

I was expecting to get the output 1 2 3 4 5 6

I have found I can instead use b <- a[!(is.na(a))], but curious why the commonly suggested na.omit does not work.


Solution

  • You do get the intended values in the output. What I think you misunderstand is that the attr(,"na.action") and attr(,"class") are simply attributes attached to the numeric vector with six non-NA numbers in it. If you do b+1, you'll get the values incremented:

    b + 1
    # [1] 2 3 4 5 6 7
    # attr(,"na.action")
    # [1] 3 6
    # attr(,"class")
    # [1] "omit"
    

    If you really want to use na.omit and remove the attributes, you can do:

    attributes(b) <- NULL
    b
    # [1] 1 2 3 4 5 6
    

    Ultimately, though, a[!is.na(a)] is much much faster, and still should be safe. Look at the `itr/sec` field to see that a[!is.na(a)] is ~10x faster on this small vector.

    bench::mark(
      isna         = a[!is.na(a)]
      omit         = na.omit(a),
      omit_no_attr = `attributes<-`(na.omit(a), NULL),
      check = FALSE)
    # # A tibble: 3 × 13
    #   expression        min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory time                gc                   
    #   <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list> <list>              <list>               
    # 1 isna         311.88ns 325.96ns  2319673.        NA      0   10000     0     4.31ms <NULL> <NULL> <bench_tm [10,000]> <tibble [10,000 × 3]>
    # 2 omit            2.8µs   3.29µs   236026.        NA     53.8  4389     1    18.59ms <NULL> <NULL> <bench_tm [4,390]>  <tibble [4,390 × 3]> 
    # 3 omit_no_attr   2.91µs   3.38µs   286354.        NA      0   10000     0    34.92ms <NULL> <NULL> <bench_tm [10,000]> <tibble [10,000 × 3]>
    

    Even on a medium-large vector, it's still faster:

    a_medium <- rep(a, 1000)
    bench::mark(isna = a_medium[!is.na(a_medium)], omit = na.omit(a_medium), omit_no_attr = `attributes<-`(na.omit(a_medium), NULL) , check = FALSE)
    # # A tibble: 3 × 13
    #   expression        min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory time                gc                   
    #   <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list> <list>              <list>               
    # 1 isna           16.2µs   18.3µs    53627.        NA     5.36  9999     1      186ms <NULL> <NULL> <bench_tm [10,000]> <tibble [10,000 × 3]>
    # 2 omit           29.4µs   33.4µs    29944.        NA     0    10000     0      334ms <NULL> <NULL> <bench_tm [10,000]> <tibble [10,000 × 3]>
    # 3 omit_no_attr   29.5µs   33.7µs    29215.        NA     2.92  9999     1      342ms <NULL> <NULL> <bench_tm [10,000]> <tibble [10,000 × 3]>
    

    But if it gets a lot larger, we start seeing some parity:

    a_big <- rep(a, 100000)
    bench::mark(isna = a_big[!is.na(a_big)], omit = na.omit(a_big), omit_no_attr = `attributes<-`(na.omit(a_big), NULL) , check = FALSE)
    # # A tibble: 3 × 13
    #   expression        min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory time             gc                
    #   <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list> <list>           <list>            
    # 1 isna           2.03ms   2.19ms      452.        NA     2.10   215     1      475ms <NULL> <NULL> <bench_tm [216]> <tibble [216 × 3]>
    # 2 omit           3.08ms    3.3ms      259.        NA     2.05   126     1      487ms <NULL> <NULL> <bench_tm [127]> <tibble [127 × 3]>
    # 3 omit_no_attr    3.1ms   3.27ms      302.        NA     2.05   147     1      487ms <NULL> <NULL> <bench_tm [148]> <tibble [148 × 3]>
    

    but since we're talking on the order if 2-3ms for a vector 800,000 long, the payoff might not be worth the squeeze.