rsoundexmetaphone

MetaPhone Functions (like SoundEx) functions and use in R?


I am wanting to use MetaPhone, Double Metaphone, Caverphone, MetaPhone3, SoundEx, and if anyone has done it yet NameX functions within 'R' so I can categorize and summarize like values to minimize data cleansing operations prior to analysis.

I am fully aware that each algorithm has its own strengths and weakness and would highly prefer not to use SoundEx but it still might work if I cannot find alternatives; as like mentioned in this post Harper would match with any of a list of unrelated names under SoundEx but should not in Metaphone for better result matching.

Though I am not sure which would serve my purposes best while still preserving some flexibility so that is the reason I want to take a stab with several of them as well as before looking at the values generate a table like the following.

enter image description here

Table Source Link

Surnames are not the subject of my initial analysis but think it is a good example as I want to effectively consider all like 'sounding' words treated as the same value is really what I am trying to do with a simply call something as values are evaluated.

Some things I have already looked at:

So I am specifically looking for an answer is to how do a MetaPhone / Caverphone function in R and know the "Value" so I can group data values by them?

The additional caveat is I am still consider my self pretty new to R as I am not a daily user of it.


Solution

  • The algorithm is pretty straightforward but I, too, could not find an existing R package. If you really need to do this work in R, one short-term option is to install the python module metaphone (pip install metaphone) then use the rPython bridge to use it in R:

    library(rPython)
    
    python.exec("from metaphone import doublemetaphone")
    python.call("doublemetaphone", "architect")
    [1] "ARKTKT" ""
    

    It's not the most elegant solution, but it gets you metaphone operations in R.

    The Apache Commons has a codec library that also implements the metaphone algorithms:

    library(rJava)
    
    .jinit() # need to have commons-codec-1.10.jar in your CLASSPATH
    
    mp <- .jnew("org.apache.commons.codec.language.Metaphone")
    .jcall(mp,"S","metaphone", "architect")
    [1] "ARXT"
    

    You can make the above .jcall an R function and use it like any other R function:

    metaphone <- function(x) {
      .jcall(mp,"S","metaphone", x)  
    }
    
    sapply(c("abridgement", "stupendous"), metaphone)
    
    ## abridgement  stupendous 
    ##      "ABRJ"      "STPN"
    

    The java interface may be more compatible across platforms, too.

    Here's a more complete view of using the java interface:

    library(rJava)
    
    .jinit()
    
    mp <- .jnew("org.apache.commons.codec.language.Metaphone")
    dmp <- .jnew("org.apache.commons.codec.language.DoubleMetaphone")
    
    metaphone <- function(x) {
      .jcall(mp,"S","metaphone", x)  
    }
    
    double_metaphone <- function(x) {
      .jcall(dmp,"S","doubleMetaphone", x)  
    }
    
    words <- c('Catherine', 'Katherine', 'Katarina', 'Johnathan', 
               'Jonathan', 'John', 'Teresa', 'Theresa', 'Smith', 
               'Smyth', 'Jessica', 'Joshua')
    
    data.frame(metaphone=sapply(words, metaphone),
               double=sapply(words, double_metaphone))
    
    ##           metaphone double
    ## Catherine      K0RN   K0RN
    ## Katherine      K0RN   K0RN
    ## Katarina       KTRN   KTRN
    ## Johnathan      JN0N   JN0N
    ## Jonathan       JN0N   JN0N
    ## John             JN     JN
    ## Teresa          TRS    TRS
    ## Theresa         0RS    0RS
    ## Smith           SM0    SM0
    ## Smyth           SM0    SM0
    ## Jessica         JSK    JSK
    ## Joshua           JX     JX