rdataframenagrepl

Where to place is.na() when using grepl to find non-missing non-matches?


I want to find those non-missing (!) entries that do not contain a defined expression in var2 and retrieve their respective value of var1. By default, grepl will return also missing entries which I want to avoid. I came up with two approaches and one of them delivers wrong results. I would like to understand why it delivers wrong results? Find both the code and the code with output below, please. Thank you!

df <- data.frame(
  var1 = 1:150,
  var2 = c(rep(NA, 100), rep("ABC", 45), rep("CBA", 5))
)  

exp <- "BC"

## Correct results with
df$var1[!grepl(exp, df$var2, fixed=T) & !is.na(df$var2)]
# [1] 146 147 148 149 150


## Incorrect results with
df$var1[!grepl(exp, df$var2[!is.na(df$var2)], fixed=T)]
print(df[c(46, 47, 48, 49,  50,  96,  97,  98,  99, 100, 146, 147, 148, 149, 150),7])
# [1]  46  47  48  49  50  96  97  98  99 100 146 147 148 149 150

print(df[c(46, 47, 48, 49,  50,  96,  97,  98,  99, 100, 146, 147, 148, 149, 150),7])
# [1] NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    "CBA" "CBA" "CBA" "CBA" "CBA"

Solution

  • Your issue is due to R's "recycling". As usual, it's much easier to see with a smaller example - let's do 15 rows instead of 150, and we'll run each component separately to see what's going on:

    ## nice small sample
    df <- data.frame(
      var1 = 1:15,
      var2 = c(rep(NA, 10), rep("ABC", 4), rep("CBA", 1))
    )  
    
    ## here's the non-missing var2 values, there are 5 of them
    df$var2[!is.na(df$var2)]
    # [1] "ABC" "ABC" "ABC" "ABC" "CBA"
    
    ## and when we grep them, we get 5 TRUE/FALSE values
    (!grepl(exp, df$var2[!is.na(df$var2)], fixed=T))
    # [1] FALSE FALSE FALSE FALSE  TRUE
    
    ## but how long is var1? 15, not 5
    df$var1
    # [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
    
    ## what happens when we index a length-15 vector by less than
    ## 15 TRUE/FALSE values? The smaller vector is "recycled", that
    ## is, repeated until it gets to the length of the larger vector.
    ## Here's a couple examples of that with a length-3 index
    df$var1[c(T, F, F)]
    # [1]  1  4  7 10 13
    df$var1[c(F, T, F)]
    # [1]  2  5  8 11 14
    df$var1[c(T, T, F)]
    # [1]  1  2  4  5  7  8 10 11 13 14
    
    ## Remember that your grepl result is length 5
    !grepl(exp, df$var2[!is.na(df$var2)], fixed=T)
    # [1] FALSE FALSE FALSE FALSE  TRUE
    
    ## So it will be recycled 3 times up to length 15,
    ## and hopefully now this result makes sense!
    df$var1[!grepl(exp, df$var2[!is.na(df$var2)], fixed=T)]
    # [1]  5 10 15
    

    I don't think there's a great way to get the right answer with this subset approach - your first approach with & is correct and proper.