rdplyrstringrfuzzyjoin

Joining two datasets by (non-uniform) names


I need to join two datasets and the only identifier in both are the company names. For example:

db1 <- tibble(
  Company = c('Bombardier Inc.','Honeywell Development Corp','The Pepsi Bottling Group (Canada), Ulc (“Pbgc”)','PepsiCo Canada ULC'),
  var1 = 1:4
)

db2 <- tibble(
  Name = c('Bombardier Inc.','Honeywell Dev Corp','The Pepsi Bottling Group (Canada), ULC','PepsiCo Canada ULC (“Pcu”)'),
  var2 = 6:9
)

Obviously a straightforward dplyr::left_join() is not going to work. I tried the following which didn't work:

fuzzyjoin::regex_left_join(db1,db2,by=c('Company'='Name'))
# A tibble: 4 x 4
  Company                                          var1 Name             var2
  <chr>                                           <int> <chr>           <int>
1 Bombardier Inc.                                     1 Bombardier Inc.     6
2 Honeywell Development Corp                          2 NA                 NA
3 The Pepsi Bottling Group (Canada), Ulc (“Pbgc”)     3 NA                 NA
4 PepsiCo Canada ULC                                  4 NA                 NA

I made some progress by removing "nonessential" characters from the names:

db1 <- db1 %>% mutate(Company.alt = str_remove_all(Company,regex(
  'The|Canada|Inc|Ltd|Company|\\bCo\\b|Corporation|Corp|Group|ULC|[:punct:]',
  ignore_case = T
)) %>% str_squish())

db2 <- db2 %>% mutate(Name.alt = str_remove_all(Name,regex(
  'The|Canada|Inc|Ltd|Company|\\bCo\\b|Corporation|Corp|Group|ULC|[:punct:]',
  ignore_case = T
)) %>% str_squish())

fuzzyjoin::regex_left_join(db1,db2,by=c('Company.alt'='Name.alt'))
# A tibble: 4 x 6
  Company                                          var1 Company.alt           Name            var2 Name.alt 
  <chr>                                           <int> <chr>                 <chr>          <int> <chr>    
1 Bombardier Inc.                                     1 Bombardier            Bombardier In~     6 Bombardi~
2 Honeywell Development Corp                          2 Honeywell Development Honeywell Dev~     7 Honeywel~
3 The Pepsi Bottling Group (Canada), Ulc (“Pbgc”)     3 Pepsi Bottling Pbgc   The Pepsi Bot~     8 Pepsi Bo~
4 PepsiCo Canada ULC                                  4 PepsiCo               NA                NA NA      

But this still left the last row unmatched. For greater clarity, the last row of Company.alt is PepsiCo which is not considered a fuzzy match with Name.alt's last row of PepsiCo Pcu.

Is there way to successfully left-join the two datasets?


Solution

  • 1) phonics The phonics package has numerous methods for approximate matching such as soundex. See the package documentation for other methods.

    library(dplyr)
    library(phonics)
    
    db1s <- mutate(db1, s = soundex(Company, clean = FALSE))
    db2s <- mutate(db2, s = soundex(Name, clean = FALSE))
    left_join(db1s, db2s)
    

    giving:

    Joining, by = "s"
    # A tibble: 4 x 5
      Company                                          var1 s     Name          var2
      <chr>                                           <int> <chr> <chr>        <int>
    1 Bombardier Inc.                                     1 B516  Bombardier ~     6
    2 Honeywell Development Corp                          2 H543  Honeywell D~     7
    3 The Pepsi Bottling Group (Canada), Ulc (“Pbgc”)     3 T112  The Pepsi B~     8
    4 PepsiCo Canada ULC                                  4 P122  PepsiCo Can~     9
    

    2) SQLite SQLite has a built-in soundex function.

    library(sqldf)
    
    sqldf("select *
      from db1
      left join db2 on soundex(Company) = soundex(Name)")
    

    giving:

                                              Company var1                                   Name var2
    1                                 Bombardier Inc.    1                        Bombardier Inc.    6
    2                      Honeywell Development Corp    2                     Honeywell Dev Corp    7
    3 The Pepsi Bottling Group (Canada), Ulc (“Pbgc”)    3 The Pepsi Bottling Group (Canada), ULC    8
    4                              PepsiCo Canada ULC    4             PepsiCo Canada ULC (“Pcu”)    9