rdataframedplyrself-join

Executing two self-joins after another to filter on two columns in a data frame


Let's assume the following dataframe:

library(dplyr)

dat <- tibble(
    Structure = c("A", "B", "X", "A-X", "B-X", "C-X", "A-Y"), 
    FirstComponent = c(NA, NA, NA, "A", "B", "C", "A"), 
    SecondComponent = c(NA, NA, NA, "X", "X", "X", "Y"),
    IsValid = c(FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE))

dat
# A tibble: 7 × 4
  Structure FirstComponent SecondComponent IsValid
  <chr>     <chr>          <chr>           <lgl>  
1 A         NA             NA              FALSE  
2 B         NA             NA              FALSE  
3 X         NA             NA              FALSE  
4 A-X       A              X               TRUE   
5 B-X       B              X               TRUE   
6 C-X       C              X               FALSE  
7 A-Y       A              Y               FALSE 

The goal is to construct a dplyr pipe that reduces this data frame using the columns Structure, FirstComponent and SecondComponent so that only the rows remain for which IsValid == TRUE. Only those rows should be retained for which the content of FirstComponent and SecondComponent can be found in the column Structure any where in the data frame:

(Inspired by the answers to "How to join a data frame to itself within a dplyr chain?") I perform self-joins using Structure and the information from FirstComponent so that rows with values in FirstComponent don't have matching values in Structure are eliminated, and then feed the result into a second self-join that does the same for SecondComponent.

Working, but impractical code

The goal is to achieve the filtering within a single dplyr pipe. I've found a solution that violates this constraint as it assumes that dat exists as a persistent object that can be explicitly provided as the first argument to the dplyr methods. In my case, this is not a given. So while the following code does produce the desired result, it's not a solution to the problem:

inner_join(
    x=inner_join(
        x=dat,
        y=select(dat, FirstComponent=Structure)),
    y=select(dat, SecondComponent=Structure))

Joining, by = "FirstComponent"
Joining, by = "SecondComponent"
# A tibble: 2 × 4
  Structure FirstComponent SecondComponent IsValid
  <chr>     <chr>          <chr>           <lgl>  
1 A-X       A              X               TRUE   
2 B-X       B              X               TRUE   

Failing solution 1

Simply adding the pipe and replacing all references to . produces an error, presumably because the . placeholder is out of scope for the nested inner_join():

dat %>%
    inner_join(
        x=inner_join(
            x=.,
            y=select(., FirstComponent=Structure)),
        y=select(., SecondComponent=Structure))
Joining, by = "FirstComponent"
Error: Can't subset columns that don't exist.
✖ Columns `x` and `y` don't exist.
Run `rlang::last_error()` to see where the error occurred.

Failing solution 2

I tried to eliminate the nested inner_join() by performing the two self-joins in a strictly consecutive way along the pipe. But this doesn't work either because by the time the second inner-join is reached, the content value of . is already reduced to exclude rows 1–3, and hence the eventual result of the second self-join that matches SecondComponent to the remaining values of Structure yields an empty data frame:

dat %>%
    inner_join(
        x=.,
        y=select(., FirstComponent=Structure)) %>%
    inner_join(
        x=., 
        y=select(., SecondComponent=Structure))

# A tibble: 0 × 4
# … with 4 variables: Structure <chr>, FirstComponent <chr>, SecondComponent <chr>,
#   IsValid <lgl>

How do I proceed?

I've considered using a T-pipe as provided by magrittrs %T>% operator to get around the issue with solution 2, but with no success.


Solution

  • I guess you are looking for a filter that looks within the columns of the same dataframe (assuming Structure won't have any NA values):

    library(dplyr)
    
    dat %>% 
      filter(FirstComponent %in% Structure, SecondComponent %in% Structure)
    
    #> # A tibble: 2 × 4
    #>   Structure FirstComponent SecondComponent IsValid
    #>   <chr>     <chr>          <chr>           <lgl>  
    #> 1 A-X       A              X               TRUE   
    #> 2 B-X       B              X               TRUE