I have a huge dataset which is similar to the columns posted below
NameofEmployee <- c(x, y, z, a)
Region <- c("Pune", "Orissa", "Orisa", "Poone")
As you can see, in the Region
column, the region "Pune" is spelled in two different ways- i.e "Pune" and "Poona".
Similarly, "Orissa" is spelled as "Orissa" and "Orisa".
I have multiple regions which are actually the same but are spelled in different ways. This will cause problems when I analyze the data.
I want to automatically be able to obtain a list of these mismatched spellings with the help of R.
I would also like to replace the spellings with the correct spellings automatically.
Misspelling is hard to detect, event more when working with names.
I'll suggest using some string distance to detect how close two words are. You can easily do this with tidystringdist, which allows to get all the combinations from a vector, and then to perform all available string distance methods from stringdist:
Region <- c("Pune", "Orissa", "Orisa", "Poone")
library(tidystringdist)
library(magrittr)
tidy_comb_all(Region) %>%
tidy_stringdist()
#> # A tibble: 6 x 12
#> V1 V2 osa lv dl hamming lcs qgram cosine jaccard jw
#> * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Pune Oris… 6 6 6 Inf 10 10 1 1 1
#> 2 Pune Orisa 5 5 5 Inf 9 9 1 1 1
#> 3 Pune Poone 2 2 2 Inf 3 3 0.433 0.4 0.217
#> 4 Orissa Orisa 1 1 1 Inf 1 1 0.0513 0 0.0556
#> 5 Orissa Poone 6 6 6 Inf 11 11 1 1 1
#> 6 Orisa Poone 5 5 5 5 10 10 1 1 1
#> # ... with 1 more variable: soundex <dbl>
Created on 2018-07-24 by the reprex package (v0.2.0).
As you can see here, Pune and Poone have an osa, lv and dl distance of 2, and Orisa / Orissa a distance of 1, suggesting their spelling is very close.
When you have identified these, you can do the replacement.