rstringstringrfuzzystringdist

Matching strings with abbreviations; fuzzy matching


I am having trouble matching character strings. Most of the difficulty centers on abbreviation

I have two character vectors. I am trying to match words in vector A (typos) to the closes match in vector B.

vec.a <- c("ce", "amer", "principl")

vec.b <- c("ceo", "american", "principal")

My first crack at this was by using stringdist package fuzzy matching command. However, I can only push it so far.

amatch(vec.a, vec.b, maxDist = 3)
[1] 1 1 3

The amatch/fuzzy matching works wonderful for typos: in this case, ce -> ceo and principl -> principal. The problem arises with abbreviations. amer should be matched with american, but ce is a closer match--on account that less permutations are needed to match. How can I deal with fuzzy matching under the presence of abbreviations?


Solution

  • Changing the dissimilarity measure to the Jaro distance or Jaro-Winkler distance works for the example provided in your question.

    library(stringdist)
    
    vec.a <- c("ce", "amer", "principl")
    vec.b <- c("ceo", "american", "principal")
    
    amatch(vec.a, vec.b, maxDist = 1, method = "jw", p = 0) # Jaro
    #> [1] 1 2 3
    amatch(vec.a, vec.b, maxDist = 1, method = "jw", p = .2) # Jaro-Winkler
    #> [1] 1 2 3