rleft-joinsynonym

Recognizing synonyms in left_join in R


I have several quite large data tables containing characters, which I would like to join with the entries in my database. The spelling is often not quite right, thus joining is not possible. I know there is no way around creating a synonym table to replace some misspelled characters. But is there a way to automatically detect certain anomalies (see example below)?

My data tables look similar to this:

data <- data.table(products=c("potatoe Chips", "potato Chips", "potato chips", "Potato-chips", "apple", "Apple", "Appl", "Apple Gala"))

The characters in my database are similar to this:

characters.database <- data.table(products=c("Potato Chips", "Potato Chips Paprika", "Apple"), ID=c("1", "2", "3"))

Currently if i perform a left_join only "Apple" will join:

data <- data %>%
  left_join(characters.database, by = c('products'))

Result:

products ID
potatoe Chips NA
potato Chips NA
potato chips NA
Potato-chips NA
apple NA
Apple 3
Appl NA
Apple Gala NA

Is it possible to automatically ignore: "Case letters", space" ", "-", and an "e" at the end of a word during left_join?

This would be the table i would like:

products ID
potatoe Chips 1
potatoChips 1
potato chips 1
Potato-chips 1
apple 1
Apple 3
Appl 1
Apple Gala NA

Any Ideas?


Solution

  • Thanks Matt Kaye for your suggestion I did something similar now. As I need the correct spelling in the data base and some of my characters contain symbols and numbers which are relevant I did the following:

    #data
    data <- data.table(products=c("potatoe Chips", "potato Chips", "potato chips", "Potato-chips", "apple", "Apple", "Appl", "Apple Gala"))
    characters.database <- data.table(products=c("Potato Chips", "Potato Chips Paprika", "Apple"), ID=c("1", "2", "3"))
    
    #remove spaces and capital letters in data
    data <- data %>%
      mutate(products= tolower(products)) %>%
      mutate(products= gsub(" ", "", products))
    
    #add ID to database
    characters.database <- characters.database %>%
      dplyr::mutate(ID = row_number())
    
    #remove spaces and capital letters in databasr product names
    characters.database_syn <- characters.database %>%
      mutate(products= tolower(products)) %>%
      mutate(products= gsub(" ", "", products))
    
    #join and add correct spelling from database
    data <- data %>%
      left_join(characters.database_syn, by = c('products')) %>%
      select(product_syn=products, 'ID') %>%
      left_join(characters.database, by = c('ID'))
    
    #other synonyms have to manually be corrected or with the help of a synonym table (As in MY data special caracters are relevant!)