rdataframedata-analysistibble

Inner_join in R with duplicates?


I have two tibbles with a list of names and their scores on one of two tests. I want to merge these dataframes so each row consists of a name, and their scores on test one and test two.

df1<-tibble(person=c("Joe", "John", "Albert", "Joe", "John"), test_1=c(90, 91, 92,93, 94 ))
df2<-tibble(person=c("Joe", "John", "Albert", "Joe"), test_2=c(94, 70, 93,50))

df3<-inner_join(df1, df2, by="person")

I would like to get this:

df4<-tibble(person=c("Joe", "Albert", "Joe"), test_1=c(90, 92, 93), test_2=c(94, 93, 94)

However, when I run the first code block, I get four different rows for person "Joe" even though there are only two observations of "Joe" in each dataframe. I only want to "match" two persons with the same name (like "Joe") if their test scores are within five points of each other.

Is there a way to do this? I would appreciate any help!

Note: I accidentally rollbacked this post when trying to edit, which removed someone's answer so I am reposting it. Very sorry to the person whose answer was removed!


Solution

  • Given your expected output then my original post would work. Just filter after doing a many-to-many join:

    library(dplyr)
    
    inner_join(df1, df2, by = "person", relationship = 'many-to-many') |>
      filter(abs(test_1 - test_2) <= 5 | n() == 1, .by = person)
    

    the .by is doing the filter by person. So then n() represents the number of rows within each by group. This additional condition is ensuring that rows with a one-to-one match will be returned no matter how far apart test_1 and test_2 are. For example, if Albert had a test_1 score of 92 and a test_2 score of 0 that row would not be filtered out because there is only one Albert.