I am working with large administrative data and have found that there are correctly-ordered names and names that are ordered in the reverse (definitely expected, but I haven't had to parse such errors before). I need to figure out which names have been reversed and correct the order of those names based on other observations of the correctly-ordered names.
Example:
tibble(names = c("MIMO DAGRYL", "LUMPERSTEIN ANASTASIA", "KEY MORPHIN", "DAGRYL MIMO", "ANASTASIA LUMPERSTEIN", "KEY MORPHIN"))
I have a dataset of first names and last names which may or may not cover the observed names.
Example:
tibble(forename = c("MIMO", "KEY", "DAVID"),
surname = c("DAGRYL", "LUMPERSTEIN", "MORPHIN"))
I have tried probabilistic matching to no avail because the data produce too many results for the packages to handle. I would like to steer clear of fuzzy matching.
Here's a solution for this toy data example using sapply
and strsplit
.
names <- tibble(forename = c("MIMO", "KEY", "DAVID"), surname = c("DAGRYL", "LUMPERSTEIN", "MORPHIN"))
names.list <- lapply(names, strsplit, split=" ")
df['new_names'] <- sapply(strsplit(df$names, " "), \(x) {
if(any((x[1]==names.list$forename & x[2]==names.list$surname))) {
paste(x[1], x[2]) # do nothing
} else
if(any((x[1]==names.list$surname & x[2]==names.list$forename)))
paste(x[2], x[1]) # reverse the names
else paste(x[1], x[2]) # do nothing
} )
df
# A tibble: 6 × 2
names new_names
<chr> <chr>
1 MIMO DAGRYL MIMO DAGRYL
2 LUMPERSTEIN ANASTASIA LUMPERSTEIN ANASTASIA
3 KEY MORPHIN KEY MORPHIN
4 DAGRYL MIMO MIMO DAGRYL
5 ANASTASIA LUMPERSTEIN ANASTASIA LUMPERSTEIN
6 KEY MORPHIN KEY MORPHIN
names
# A tibble: 3 × 2
forename surname
<chr> <chr>
1 MIMO DAGRYL
2 KEY LUMPERSTEIN
3 DAVID MORPHIN