I have several quite large data tables containing characters, which I would like to join with the entries in my database. The spelling is often not quite right, thus joining is not possible. I know there is no way around creating a synonym table to replace some misspelled characters. But is there a way to automatically detect certain anomalies (see example below)?
My data tables look similar to this:
data <- data.table(products=c("potatoe Chips", "potato Chips", "potato chips", "Potato-chips", "apple", "Apple", "Appl", "Apple Gala"))
The characters in my database are similar to this:
characters.database <- data.table(products=c("Potato Chips", "Potato Chips Paprika", "Apple"), ID=c("1", "2", "3"))
Currently if i perform a left_join only "Apple" will join:
data <- data %>%
left_join(characters.database, by = c('products'))
Result:
products | ID |
---|---|
potatoe Chips | NA |
potato Chips | NA |
potato chips | NA |
Potato-chips | NA |
apple | NA |
Apple | 3 |
Appl | NA |
Apple Gala | NA |
Is it possible to automatically ignore: "Case letters", space" ", "-", and an "e" at the end of a word during left_join?
This would be the table i would like:
products | ID |
---|---|
potatoe Chips | 1 |
potatoChips | 1 |
potato chips | 1 |
Potato-chips | 1 |
apple | 1 |
Apple | 3 |
Appl | 1 |
Apple Gala | NA |
Any Ideas?
Thanks Matt Kaye for your suggestion I did something similar now. As I need the correct spelling in the data base and some of my characters contain symbols and numbers which are relevant I did the following:
#data
data <- data.table(products=c("potatoe Chips", "potato Chips", "potato chips", "Potato-chips", "apple", "Apple", "Appl", "Apple Gala"))
characters.database <- data.table(products=c("Potato Chips", "Potato Chips Paprika", "Apple"), ID=c("1", "2", "3"))
#remove spaces and capital letters in data
data <- data %>%
mutate(products= tolower(products)) %>%
mutate(products= gsub(" ", "", products))
#add ID to database
characters.database <- characters.database %>%
dplyr::mutate(ID = row_number())
#remove spaces and capital letters in databasr product names
characters.database_syn <- characters.database %>%
mutate(products= tolower(products)) %>%
mutate(products= gsub(" ", "", products))
#join and add correct spelling from database
data <- data %>%
left_join(characters.database_syn, by = c('products')) %>%
select(product_syn=products, 'ID') %>%
left_join(characters.database, by = c('ID'))
#other synonyms have to manually be corrected or with the help of a synonym table (As in MY data special caracters are relevant!)