rstringtext-analysis

Automatically extracting strings with mismatched spellings from a column and replacing them in R


I have a huge dataset which is similar to the columns posted below

NameofEmployee <- c(x, y, z, a)
Region <- c("Pune", "Orissa", "Orisa", "Poone")

As you can see, in the Region column, the region "Pune" is spelled in two different ways- i.e "Pune" and "Poona".

Similarly, "Orissa" is spelled as "Orissa" and "Orisa".

I have multiple regions which are actually the same but are spelled in different ways. This will cause problems when I analyze the data.

I want to automatically be able to obtain a list of these mismatched spellings with the help of R.
I would also like to replace the spellings with the correct spellings automatically.


Solution

  • Misspelling is hard to detect, event more when working with names.

    I'll suggest using some string distance to detect how close two words are. You can easily do this with tidystringdist, which allows to get all the combinations from a vector, and then to perform all available string distance methods from stringdist:

    Region <- c("Pune", "Orissa", "Orisa", "Poone")
    
    library(tidystringdist)
    library(magrittr)
    
    tidy_comb_all(Region) %>%
      tidy_stringdist()
    #> # A tibble: 6 x 12
    #>   V1     V2      osa    lv    dl hamming   lcs qgram cosine jaccard     jw
    #> * <chr>  <chr> <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>  <dbl>   <dbl>  <dbl>
    #> 1 Pune   Oris…     6     6     6     Inf    10    10 1          1   1     
    #> 2 Pune   Orisa     5     5     5     Inf     9     9 1          1   1     
    #> 3 Pune   Poone     2     2     2     Inf     3     3 0.433      0.4 0.217 
    #> 4 Orissa Orisa     1     1     1     Inf     1     1 0.0513     0   0.0556
    #> 5 Orissa Poone     6     6     6     Inf    11    11 1          1   1     
    #> 6 Orisa  Poone     5     5     5       5    10    10 1          1   1     
    #> # ... with 1 more variable: soundex <dbl>
    

    Created on 2018-07-24 by the reprex package (v0.2.0).

    As you can see here, Pune and Poone have an osa, lv and dl distance of 2, and Orisa / Orissa a distance of 1, suggesting their spelling is very close.

    When you have identified these, you can do the replacement.