rtidyverseanti-join

How to use anti_join with different levels of two variables?


I have been trying for hours and I can't figure it out. I have a data frame with subjects and conditions df1, from which I want to exclude observations which have a certain value (less than 3 in the variable "value" from df2. I cannot make it work because I need to remove from df1, combinations of different levels of two variables.

This is df1:

df1 <- structure(list(subject = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,2L, 2L, 2L, 2L, 
                                  2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), 
                      condition = c("A", "A", "A", "B", "B", "B", "C", "C","C", "A", "A", 
                                    "A", "B", "B", "B", "C", "C", "C", "A", "A", "A","B", "B", "B", "C", "C", "C")), 
                 row.names = c(NA, -27L), class = c("tbl_df", "tbl", "data.frame"))

And this is df2

df2 <- structure(list(subject = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L,4L, 4L, 4L, 5L, 5L, 5L), 
                      condition = c("A", "B", "C", "A", "B","C", "A", "B", "C", "A", "B", "C", "A", "B", "C"), 
                      value = c(10L, 8L, 7L, 3L, 8L, 5L, 3L, 3L, 9L, 8L, 7L, 8L, 10L, 6L, 2L)), 
                 row.names = c(NA,-15L), class = c("tbl_df", "tbl", "data.frame"))

And I want to remove in df1 all the combinations of subject and condition with a value under 3 so this would be the final df:

df3 <- structure(list(subject = c(2L, 3L, 3L, 5L), 
                      condition = c("A","A", "B", "C")), 
                 row.names = c(NA, -4L), 
                 class = c("tbl_df","tbl", "data.frame"))

So far I have been doing it like this, but I can't anymore because I have hundreds of rows...

df3 <- df1 %>% filter(!(subject==2 & condition=="A" |
                        subject==3 & (condition=="A" | condition=="B") |
                        subject==5 & condition=="C"))

Solution

  • Your sample result for df3 conflicts with the code you use to derive it, so here is a dplyr solution for each interpretation of what you want for df3.

    Note: Both results are only possible when you

    ...exclude observations which have a certain value (less than [or equal to] 3 in the variable "value" from df2.

    so I have implemented these solutions using the inequality <= 3 rather than < 3.

    1st Interpretation of df3

    To obtain the version of df3

    # A tibble: 4 x 2
      subject condition
        <int> <chr>    
    1       2 A        
    2       3 A        
    3       3 B        
    4       5 C        
    

    that you provide here as a sample result

    And I want to remove in df1 all the combinations of subject and condition with a value under 3 so this would be the final df:

    df3 <- structure(list(subject = c(2L, 3L, 3L, 5L), 
                          condition = c("A","A", "B", "C")), 
                     row.names = c(NA, -4L), 
                     class = c("tbl_df","tbl", "data.frame"))
    

    simply use filter() on df2:

    library(dplyr)
    
    
    # ...
    # Code to generate 'df1' and 'df2'.
    # ...
    
    df3 <- df2 %>% filter(value <= 3)
    

    2nd Interpretation of df3

    However, I it appears you actually desire the following version of df3

    # A tibble: 18 x 2
       subject condition
         <int> <chr>    
     1       1 A        
     2       1 A        
     3       1 A        
     4       1 B        
     5       1 B        
     6       1 B        
     7       1 C        
     8       1 C        
     9       1 C        
    10       2 B        
    11       2 B        
    12       2 B        
    13       2 C        
    14       2 C        
    15       2 C        
    16       3 C        
    17       3 C        
    18       3 C        
    

    which you derive here:

    df3 <- df1 %>% filter(!(subject==2 & condition=="A" |
                            subject==3 & (condition=="A" |condition=="B") |
                            subject==5 & condition=="C"))
    

    In that case, you should anti_join() your df1 to a filter()ed version of df2:

    library(dplyr)
    
    
    # ...
    # Code to generate 'df1' and 'df2'.
    # ...
    
    
    df3 <- df1 %>%
      anti_join(df2 %>% filter(value <= 3), by = c("subject", "condition"))