rsortingpackagemisspelling

R: Consolidate different spellings of the same entry into one


I have a data set that is sorted by company names. Sometimes the names are misspelled and show as unique entries:

Name
ABC Company
ABc Company
DEF Company
def compANY
Ddf Cmpany
abC comPany

In fact, these entries are variations of the same two company names. This is clearly a problem with my initial data set but I need to take care of it to process my data correctly.

Name
ABC Company
DEF Company

I don't know how I can approach this, other than long loops that test modified versions of the words against a dictionary-like data structure. Is there a library for spellchecking (and would that even make sense for company names)?

I'd appreciate any help and don't have a preference for any package. Thank you.


Solution

  • You can use adist to get the Approximate String Distances which can be used in hclust to get clusters which can be classified in groups with cutree.

    hc <- hclust(as.dist(adist(Name, ignore.case = TRUE)))
    Name[!duplicated(cutree(hc,k=2))] #For two groups
    #[1] "ABC Company" "DEF Company"
    

    Data:

    Name <- c("ABC Company","ABc Company","DEF Company","def compANY","Ddf Cmpany","abC comPany")