Background: I am using R to help me figure out which entries in file A (expected names) map to which entries in file B (possible names). Unfortunately, file B has multiple entries with the same ID and different names. Sometimes these names are exact matches of the ID in file A, but other times they are close or reworded. For now, I am finding those direct 1:1 matches, and removing the non-matches with the same ID.
Question: Is there a cleaner way to remove known non-matches inside of groups which contain 1:1 matches, while not removing non-matches in other groups?
Example data:
example <- tribble(
~id, ~expect, ~possible, ~status,
1, "box of forks", "spoon drawer", "review",
1, "box of forks", "box of forks", "match",
1, "box of forks", "cheese knife", "review",
2, "dish washer", "dish washing machine", "review",
2, "dish washer", "oven", "review",
2, "dish washer", "microwave", "review",
)
In this example, row 2 is a 1:1 match, so it was given the status "match". I, definitely a human, know that row 4 is also a match, because I can read and understand the data set. Since I expect there will be cases like this, I can't simply do filter(status == "match")
. However, before I get into reading the file, I want to remove the rows that I now know are going to be junk. As such, I have figured out that I can do:
example %>%
group_by(id) %>%
mutate(matches = case_when(
status == "review" ~ 0,
status == "match" ~ 1,
),
total = sum(matches)
) %>%
filter(
!(matches == 0 & total > 0)
)
Which gives the correct/expected result:
id expect possible status matches total
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 1 box of forks box of forks match 1 1
2 2 dish washer dish washing machine review 0 0
3 2 dish washer oven review 0 0
4 2 dish washer microwave review 0 0
While this works, it looks clunky and makes me sad. Is there a cleaner way of removing entries that are still "review" and exist inside of groups which contain at least one row with "match"?
Maybe just use all
.
example %>%
filter(all(status=="review") | status=="match", .by=id)
# A tibble: 4 × 4
id expect possible status
<dbl> <chr> <chr> <chr>
1 1 box of forks box of forks match
2 2 dish washer dish washing machine review
3 2 dish washer oven review
4 2 dish washer microwave review