rdplyrnasemi-join

Join data but ignore missing values


I am having some trouble with joining data frames with dplyr, where I would like to ignore the NAs.

The data that I have is quite big, but a simplified version looks like:

id <- c("id1", "id2", "id3", "id4")
A <- c("E", "F", "G", NA)
B <- c("T", NA, "N", "T")
C <- c(NA, "T", "U", NA)

df <- data.frame(A, B, C)

     id    A    B    C
1    id1   E    T    NA
2    id2   F    NA   T
3    id3   G    N    U
4    id4   NA   T    NA

I have an entry that I would like to match with df, which is e.g.:

df2 <- data.frame(A = "E", B = "T", C = "M")

    A    B    C
1   E    T    M

As a result I would like to obtain all rows from df that match with df2, but the NAs should be ignored. So the result should look like this:

     id    A    B    C
1    id1   E    T    NA
2    id4   NA   T    NA

I was trying to do this with semi_join, but it did not work so far:

result <- df %>%
  group_by(n = seq(n())) %>%
  do(modify_if(., is.na, ~NULL) %>%
       semi_join(df2, by = c("A", "B", "C"))) %>%
  ungroup %>%
  select(-n)

Which results in:

Error: `by` can't contain join column `C` which is missing from LHS
Call `rlang::last_error()` to see a backtrace

Who knows the answer?


Solution

  • Here's a solution with a mix of tidyverse and base R. I think this is pretty clear, but I'd be interested in a pure tidyverse implementation that isn't completely contrived.

    The idea is to first expand all entries in df and df2 and then filter through all the columns using a loop.

    The data:

    id <- c("id1", "id2", "id3", "id4")
    A <- c("E", "F", "G", NA)
    B <- c("T", NA, "N", "T")
    C <- c(NA, "T", "U", NA)
    
    df <- data.frame(id, A, B, C, stringsAsFactors = F) # Make sure to use strings not factors
    df2 <- data.frame(A = "E", B = "T", C = "M", stringsAsFactors = F)
    

    Code:

    library(tidyr)
    results <- crossing(df, df2)
    select_columns <- c("A", "B", "C")
    for(col in select_columns) {
      keep <- is.na(results[[col]]) | results[[col]] == results[[paste0(col, 1)]]
      results <- results[keep,, drop=F]
    }
    results <- results %>% dplyr::select(id, A:C) %>% distinct
    results
    
       id    A B    C
    1 id1    E T <NA>
    2 id4 <NA> T <NA>