I'm trying to join two dataframes in R, each of which has multiple repetitions of the same ID number. This is to be expected within the data and is not a data quality issue.
It was my understanding that I should do a left join between the two dataframes, specify the many to many relationship, and this would work. However, my IDs get duplicated in a way I don't expect.
Here is some test data:
test4 <- data.frame(ID = c('A', 'A', 'B', 'B', 'C'), Score = c(1,2,3,4,5))
test5 <- data.frame(ID = c('A', 'A', 'B', 'B', 'C'), Race = c('W','W','B','B','W'))
test6 <- test4 %>% left_join(test5, by = "ID", relationship = "many-to-many")
This outputs the data like so:
ID Score Race
A 1 W
A 1 W
A 2 W
A 2 W
B 3 B
B 3 B
B 4 B
B 4 B
C 5 W
When I want it to output like so:
ID Score Race
A 1 W
A 2 W
B 3 B
B 4 B
C 5 W
I'm willing to bet that I'm missing something simple, but I just can't figure out what. Any advice much appreciated!
you can use distinct
before applying left_join
, e.g.,
test4 %>%
left_join(distinct(test5), by = "ID")
or specify multiple = "first"
in left_join
test4 %>%
left_join(test5, by = "ID", multiple = "first")
which gives
ID Score Race
1 A 1 W
2 A 2 W
3 B 3 B
4 B 4 B
5 C 5 W