rstringdataframematching

Method for finding reversed names in variable


I am working with large administrative data and have found that there are correctly-ordered names and names that are ordered in the reverse (definitely expected, but I haven't had to parse such errors before). I need to figure out which names have been reversed and correct the order of those names based on other observations of the correctly-ordered names.

Example:

tibble(names = c("MIMO DAGRYL", "LUMPERSTEIN ANASTASIA", "KEY MORPHIN", "DAGRYL MIMO", "ANASTASIA LUMPERSTEIN", "KEY MORPHIN"))

I have a dataset of first names and last names which may or may not cover the observed names.

Example:

tibble(forename = c("MIMO", "KEY", "DAVID"),
       surname = c("DAGRYL", "LUMPERSTEIN", "MORPHIN"))

I have tried probabilistic matching to no avail because the data produce too many results for the packages to handle. I would like to steer clear of fuzzy matching.


Solution

  • Here's a solution for this toy data example using sapply and strsplit.

    names <- tibble(forename = c("MIMO", "KEY", "DAVID"), surname = c("DAGRYL", "LUMPERSTEIN", "MORPHIN"))
    
    names.list <- lapply(names, strsplit, split=" ")
    
    df['new_names'] <- sapply(strsplit(df$names, " "), \(x) {
                         if(any((x[1]==names.list$forename & x[2]==names.list$surname))) {
                           paste(x[1], x[2]) # do nothing
                           } else 
                             if(any((x[1]==names.list$surname & x[2]==names.list$forename))) 
                               paste(x[2], x[1]) # reverse the names
                             else paste(x[1], x[2]) # do nothing
                         } )
    
    df
    # A tibble: 6 × 2
      names                 new_names            
      <chr>                 <chr>                
    1 MIMO DAGRYL           MIMO DAGRYL          
    2 LUMPERSTEIN ANASTASIA LUMPERSTEIN ANASTASIA
    3 KEY MORPHIN           KEY MORPHIN          
    4 DAGRYL MIMO           MIMO DAGRYL          
    5 ANASTASIA LUMPERSTEIN ANASTASIA LUMPERSTEIN
    6 KEY MORPHIN           KEY MORPHIN
    

    names
    # A tibble: 3 × 2
      forename surname    
      <chr>    <chr>      
    1 MIMO     DAGRYL     
    2 KEY      LUMPERSTEIN
    3 DAVID    MORPHIN