I have a table with misspelling words. I need to correct those using from the words more similar to that one, the one that have more frequency.
For example, after I run
aggregate(CustomerID ~ Province, ventas2, length)
I get
1
2 AMBA 29
3 BAIRES 1
4 BENOS AIRES 1
12 BUENAS AIRES 1
17 BUENOS AIRES 4
18 buenos aires 7
19 Buenos Aires 3
20 BUENOS AIRES 11337
35 CORDOBA 2297
36 cordoba 1
38 CORDOBESA 1
39 CORRIENTES 424
So I need to replace buenos aires, Buenos Aires, Baires, BUENOS AIRES, with BUENOS AIRES but AMBA shouldn't be replaced. Also CORDOBESA and cordoba should be replaced by CORDOBA, but not CORRIENTES.
How can I do this in R?
Thanks!
Here's a possibile solution.
Disclaimer :
This code seems to works fine with your current example. I don't assure that the current parameters (e.g. cut height, cluster agglomeration method, distance method etc.) will be valid for your real (complete) data.
# recreating your data
data <-
read.csv(text=
'City,Occurr
AMBA,29
BAIRES,1
BENOS AIRES,1
BUENAS AIRES,1
BUENOS AIRES,4
buenos aires,7
Buenos Aires,3
BUENOS AIRES,11337
CORDOBA,2297
cordoba,1
CORDOBESA,1
CORRIENTES,424',stringsAsFactors=F)
# simple pre-processing to city strings:
# - removing spaces
# - turning strings to uppercase
cities <- gsub('\\s+','',toupper(data$City))
# string distance computation
# N.B. here you can play with single components of distance costs
d <- adist(cities, costs=list(insertions=1, deletions=1, substitutions=1))
# assign original cities names to distance matrix
rownames(d) <- data$City
# clustering cities
hc <- hclust(as.dist(d),method='single')
# plot the cluster dendrogram
plot(hc)
# add the cluster rectangles (just to see the clusters)
# N.B. I decided to cut at distance height < 5
# (read it as: "I consider equal 2 strings needing
# less than 5 modifications to pass from one to the other")
# Obviously you can use another value.
rect.hclust(hc,h=4.9)
# get the clusters ids
clusters <- cutree(hc,h=4.9)
# turn into data.frame
clusters <- data.frame(City=names(clusters),ClusterId=clusters)
# merge with frequencies
merged <- merge(data,clusters,all.x=T,by='City')
# add CityCorrected column to the merged data.frame
ret <- by(merged,
merged$ClusterId,
FUN=function(grp){
idx <- which.max(grp$Occur)
grp$CityCorrected <- grp[idx,'City']
return(grp)
})
fixed <- do.call(rbind,ret)
Result :
> fixed
City Occurr ClusterId CityCorrected
1 AMBA 29 1 AMBA
2.2 BAIRES 1 2 BUENOS AIRES
2.3 BENOS AIRES 1 2 BUENOS AIRES
2.4 BUENAS AIRES 1 2 BUENOS AIRES
2.5 BUENOS AIRES 4 2 BUENOS AIRES
2.6 buenos aires 7 2 BUENOS AIRES
2.7 Buenos Aires 3 2 BUENOS AIRES
2.8 BUENOS AIRES 11337 2 BUENOS AIRES
3.9 cordoba 1 3 CORDOBA
3.10 CORDOBA 2297 3 CORDOBA
3.11 CORDOBESA 1 3 CORDOBA
4 CORRIENTES 424 4 CORRIENTES
Cluster Plot :