Let's assume the following dataframe:
library(dplyr)
dat <- tibble(
Structure = c("A", "B", "X", "A-X", "B-X", "C-X", "A-Y"),
FirstComponent = c(NA, NA, NA, "A", "B", "C", "A"),
SecondComponent = c(NA, NA, NA, "X", "X", "X", "Y"),
IsValid = c(FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE))
dat
# A tibble: 7 × 4
Structure FirstComponent SecondComponent IsValid
<chr> <chr> <chr> <lgl>
1 A NA NA FALSE
2 B NA NA FALSE
3 X NA NA FALSE
4 A-X A X TRUE
5 B-X B X TRUE
6 C-X C X FALSE
7 A-Y A Y FALSE
The goal is to construct a dplyr
pipe that reduces this data frame using the columns Structure
, FirstComponent
and SecondComponent
so that only the rows remain for which IsValid == TRUE
. Only those rows should be retained for which the content of FirstComponent
and SecondComponent
can be found in the column Structure
any where in the data frame:
FirstComponent
and SecondComponent
are NA
FirstComponent
and SecondComponent
can be found as values in Structure
.C
in FirstComponent
doesn't occur anywhere as a value of Structure
.Y
in SecondComponent
doesn't occur anywhere else as a value of Structure
.(Inspired by the answers to "How to join a data frame to itself within a dplyr chain?") I perform self-joins using Structure
and the information from FirstComponent
so that rows with values in FirstComponent
don't have matching values in Structure
are eliminated, and then feed the result into a second self-join that does the same for SecondComponent
.
Working, but impractical code
The goal is to achieve the filtering within a single dplyr
pipe. I've found a solution that violates this constraint as it assumes that dat
exists as a persistent object that can be explicitly provided as the first argument to the dplyr
methods. In my case, this is not a given. So while the following code does produce the desired result, it's not a solution to the problem:
inner_join(
x=inner_join(
x=dat,
y=select(dat, FirstComponent=Structure)),
y=select(dat, SecondComponent=Structure))
Joining, by = "FirstComponent"
Joining, by = "SecondComponent"
# A tibble: 2 × 4
Structure FirstComponent SecondComponent IsValid
<chr> <chr> <chr> <lgl>
1 A-X A X TRUE
2 B-X B X TRUE
Failing solution 1
Simply adding the pipe and replacing all references to .
produces an error, presumably because the .
placeholder is out of scope for the nested inner_join()
:
dat %>%
inner_join(
x=inner_join(
x=.,
y=select(., FirstComponent=Structure)),
y=select(., SecondComponent=Structure))
Joining, by = "FirstComponent"
Error: Can't subset columns that don't exist.
✖ Columns `x` and `y` don't exist.
Run `rlang::last_error()` to see where the error occurred.
Failing solution 2
I tried to eliminate the nested inner_join()
by performing the two self-joins in a strictly consecutive way along the pipe. But this doesn't work either because by the time the second inner-join is reached, the content value of .
is already reduced to exclude rows 1–3, and hence the eventual result of the second self-join that matches SecondComponent
to the remaining values of Structure
yields an empty data frame:
dat %>%
inner_join(
x=.,
y=select(., FirstComponent=Structure)) %>%
inner_join(
x=.,
y=select(., SecondComponent=Structure))
# A tibble: 0 × 4
# … with 4 variables: Structure <chr>, FirstComponent <chr>, SecondComponent <chr>,
# IsValid <lgl>
How do I proceed?
I've considered using a T-pipe as provided by magrittr
s %T>%
operator to get around the issue with solution 2, but with no success.
I guess you are looking for a filter
that looks within the columns of the same dataframe (assuming Structure
won't have any NA
values):
library(dplyr)
dat %>%
filter(FirstComponent %in% Structure, SecondComponent %in% Structure)
#> # A tibble: 2 × 4
#> Structure FirstComponent SecondComponent IsValid
#> <chr> <chr> <chr> <lgl>
#> 1 A-X A X TRUE
#> 2 B-X B X TRUE