I am trying to use R to filter a dataframe based on two columns to select rows that are agnostic to which column of those two columns a value occurs in.
The input looks like this, with duplicates bolded. Note that the ID
column does NOT contribute to uniqueness:
ID | Col_1 | Col_2 |
---|---|---|
1 | A | A |
1 | A | B |
1 | A | C |
1 | B | A |
1 | B | B |
1 | B | C |
2 | C | B |
2 | C | C |
2 | C | D |
I tried to filter by removing rows based on values of Col_2
that don't show up in Col_1
or vice versa, but that removes entirely too much. What I'd like the output to be is something like:
ID | Col_1 | Col_2 |
---|---|---|
1 | A | A |
1 | A | B |
1 | A | C |
1 | B | B |
1 | B | C |
2 | C | C |
2 | C | D |
I can't swap values between Col_1
and Col_2
, because in the cases where there is no "duplication" it is relevant which column a given value appears in. It also doesn't matter whether, e.g., A B
is selected over B A
, just that only one is retained.
One way to solve your problem:
library(dplyr)
df |>
distinct(C1 = pmin(Col_1, Col_2), C2 = pmax(Col_1, Col_2), .keep_all=T) |>
select(-c(C1, C2))
ID Col_1 Col_2
1 1 A A
2 1 A B
3 1 A C
4 1 B B
5 1 B C
6 2 C C
7 2 C D
Or using built-in functions:
with(df, df[!duplicated(cbind(pmin(Col_1, Col_2), pmax(Col_1, Col_2))),])