I have two vectors: Candidates$names
containing roughly 45.000 names of electoral candidates and Incumbents$names
containing roughly 7600 names of members of parliament. I want to check for each of the names in Candidates
whether it exists in Incumbents
and create a new dummy variable incumbent
in Candidates
that takes the value 1 if that is the case, 0 if not.
My problem is that the names between the two lists can vary ever so slightly. Sometimes there are titles included in the name, sometimes middle names etc. So direct matching would not work reliably, but I need an approach that allows for some fuzziness.
I tried expand.grid(Candidates$names, Incumbents$names)
in combination with adist()
as an indicator of proximity and then setting arbitrary percentages (based on distance/length of name) as a cut-off point, but the lengths of the resulting table crashes R on my computer and the method does not appear practical or sufficiently reliable.
Is there a better way to perform the required fuzzy matching?
EDIT: Here some example vectors.
Candidates <- data.frame(name = c("Barack Obama", "George W. Bush", "Jimmy Carter", "Tony Blair", "Mickey Mouse", "Darth Vader"), incumbent = NA)
Incumbents <- data.frame(name = c("Anakin Skywalker", "Sir Tony Blair", "Barack Hussein Obama", "James Carter"))
The resulting data frame should look something like this:
Candidates <- data.frame(name = c("Barack Obama", "George W. Bush", "Jimmy Carter", "Tony Blair", "Mickey Mouse", "Darth Vader"), incumbent = c(1, 0, 1, 1, 0, 0)
EDIT #2: The response by phiver was very helpful, yet I have run into the problem that some names are in my dataset more than once. To identify them uniquely, I would like to use an additional variable Candidates$party
and Incumbents$party
in the matching process. How would I include this precise additional matching variable in the code?
To modify my example:
Candidates <- data.frame(name = c("Barack Obama", "George W. Bush", "Jimmy Carter", "Tony Blair", "Mickey Mouse", "Darth Vader", "John Smith", "John Smith"), party = c("Democrat", "Republican", "Democrat", "Democrat", "Republican", "Democrat", "Democrat", "Republican") , incumbent = NA)
Incumbents <- data.frame(name = c("Anakin Skywalker", "Sir Tony Blair", "Barack Hussein Obama", "James Carter", "John Smith"), party = ("Republican", "Democrat", "Democrat", "Democrat", "Republican")
The output should be the same as before with the addition of the Republican John Smith as an incumbent.
Candidates <- data.frame(name = c("Barack Obama", "George W. Bush", "Jimmy Carter", "Tony Blair", "Mickey Mouse", "Darth Vader", "John Smith", "John Smith"), party = c("Democrat", "Republican", "Democrat", "Democrat", "Republican", "Democrat", "Democrat", "Republican") , incumbent = c(1, 0, 1, 1, 0, 0, 0, 1)
using the fuzzyjoin package you can match the names in a join.
With the example given this code will reproduce the expected example output.
library(fuzzyjoin)
library(dplyr)
Candidates %>%
stringdist_left_join(Incumbents, by = c("name" = "name"), method = "jw", max_dist = 0.2) %>%
rename(candidate_name = name.x) %>%
mutate(incumbent = if_else(!is.na(name.y), 1, 0)) %>%
select(-name.y)
candidate_name incumbent
1 Barack Obama 1
2 George W. Bush 0
3 Jimmy Carter 1
4 Tony Blair 1
5 Mickey Mouse 0
6 Darth Vader 0
Now the reason for jw is that this has been developed for matching names where there are only a few mistakes. The closer to 0 the more correct the name, with 0 being the exactly the same name. Choosing the correct max_dist is a bit of finetuning. Cleaning up some names might help.
To get get rid of titles like "Sir" etc. you could use regex on the incubent names directly with gsub
or use stringr::str_remove
before using the join.
You can extend the join to join on multiple columns.
Candidates %>%
stringdist_left_join(Incumbents, by = c("name" = "name", "party" = "party"),
method = "jw",
max_dist = 0.2) %>%
rename(candidate_name = name.x,
candidate_party = party.x) %>%
mutate(incumbent = if_else(!is.na(name.y), 1, 0)) %>%
select(-ends_with(".y")) # remove not needed columns coming from incubent table
candidate_name candidate_party incumbent
1 Barack Obama Democrat 1
2 George W. Bush Republican 0
3 Jimmy Carter Democrat 1
4 Tony Blair Democrat 1
5 Mickey Mouse Republican 0
6 Darth Vader Democrat 0
7 John Smith Democrat 0
8 John Smith Republican 1