I am wanting to use MetaPhone, Double Metaphone, Caverphone, MetaPhone3, SoundEx, and if anyone has done it yet NameX functions within 'R' so I can categorize and summarize like values to minimize data cleansing operations prior to analysis.
I am fully aware that each algorithm has its own strengths and weakness and would highly prefer not to use SoundEx but it still might work if I cannot find alternatives; as like mentioned in this post Harper would match with any of a list of unrelated names under SoundEx but should not in Metaphone for better result matching.
Though I am not sure which would serve my purposes best while still preserving some flexibility so that is the reason I want to take a stab with several of them as well as before looking at the values generate a table like the following.
Surnames are not the subject of my initial analysis but think it is a good example as I want to effectively consider all like 'sounding' words treated as the same value is really what I am trying to do with a simply call something as values are evaluated.
Some things I have already looked at:
So I am specifically looking for an answer is to how do a MetaPhone / Caverphone function in R and know the "Value" so I can group data values by them?
The additional caveat is I am still consider my self pretty new to R as I am not a daily user of it.
The algorithm is pretty straightforward but I, too, could not find an existing R package. If you really need to do this work in R, one short-term option is to install the python module metaphone
(pip install metaphone
) then use the rPython
bridge to use it in R:
library(rPython)
python.exec("from metaphone import doublemetaphone")
python.call("doublemetaphone", "architect")
[1] "ARKTKT" ""
It's not the most elegant solution, but it gets you metaphone operations in R.
The Apache Commons has a codec library that also implements the metaphone algorithms:
library(rJava)
.jinit() # need to have commons-codec-1.10.jar in your CLASSPATH
mp <- .jnew("org.apache.commons.codec.language.Metaphone")
.jcall(mp,"S","metaphone", "architect")
[1] "ARXT"
You can make the above .jcall
an R function and use it like any other R function:
metaphone <- function(x) {
.jcall(mp,"S","metaphone", x)
}
sapply(c("abridgement", "stupendous"), metaphone)
## abridgement stupendous
## "ABRJ" "STPN"
The java interface may be more compatible across platforms, too.
Here's a more complete view of using the java interface:
library(rJava)
.jinit()
mp <- .jnew("org.apache.commons.codec.language.Metaphone")
dmp <- .jnew("org.apache.commons.codec.language.DoubleMetaphone")
metaphone <- function(x) {
.jcall(mp,"S","metaphone", x)
}
double_metaphone <- function(x) {
.jcall(dmp,"S","doubleMetaphone", x)
}
words <- c('Catherine', 'Katherine', 'Katarina', 'Johnathan',
'Jonathan', 'John', 'Teresa', 'Theresa', 'Smith',
'Smyth', 'Jessica', 'Joshua')
data.frame(metaphone=sapply(words, metaphone),
double=sapply(words, double_metaphone))
## metaphone double
## Catherine K0RN K0RN
## Katherine K0RN K0RN
## Katarina KTRN KTRN
## Johnathan JN0N JN0N
## Jonathan JN0N JN0N
## John JN JN
## Teresa TRS TRS
## Theresa 0RS 0RS
## Smith SM0 SM0
## Smyth SM0 SM0
## Jessica JSK JSK
## Joshua JX JX