I have a data set that is sorted by company names. Sometimes the names are misspelled and show as unique entries:
Name
ABC Company
ABc Company
DEF Company
def compANY
Ddf Cmpany
abC comPany
In fact, these entries are variations of the same two company names. This is clearly a problem with my initial data set but I need to take care of it to process my data correctly.
Name
ABC Company
DEF Company
I don't know how I can approach this, other than long loops that test modified versions of the words against a dictionary-like data structure. Is there a library for spellchecking (and would that even make sense for company names)?
I'd appreciate any help and don't have a preference for any package. Thank you.
You can use adist
to get the Approximate String Distances which can be used in hclust
to get clusters which can be classified in groups with cutree
.
hc <- hclust(as.dist(adist(Name, ignore.case = TRUE)))
Name[!duplicated(cutree(hc,k=2))] #For two groups
#[1] "ABC Company" "DEF Company"
Data:
Name <- c("ABC Company","ABc Company","DEF Company","def compANY","Ddf Cmpany","abC comPany")