stringgocosine-similaritysentence-similaritypython-dedupe

String Similarity for all possible combination in Optimised fashion


I am facing a problem while finding string similarity.

Scenario: The string which consisits of following fields first_name, middle_name and last_name

What I have do is to find string similarity between A and B (both have same fields) but making sure all possibilities are considered.

Case 1: say string A has first name is : Rahul middle name is: Kumar last name is : " "

And string B has first name : Kumar middle name: " " last name: Rahul

By seeing we can say both names might be same. But Current Similarity algorithms are giving around 71% similarity.

Case 2:

Say, string B has first name : Rahul middle name: " " last name: K.

In this case similarity %age falls down, but the name might be same.

What should I do to similarity ,considering all possibile combinations of first,middle and last name and in an optimised fashion?

E.g Rahul Rakesh Kumar can be written as Rahul Kumar Rahul Rahul K. Kumar Rahul Kumar rakesh Rahul. Rahul R. Kumar etc.

I had tried using Jaccard, Cosine, Jaro-Wrinklr similarity algorithms but result are not satisafactory.

Note: I have to find dedupe based on Names that's why I have to consider all possible of Names.


Solution

  • For this kind of string similarity I have recently implemented an algorithm that works great for my use-case, which is very similar to yours. Here is the blog post: http://www.catalysoft.com/articles/strikeamatch.html

    To compare two strings:

    1. Make a list of overlapping letter-pairs. "kumar" gives us pairs: "ku", "um", "ma", "ar".
    2. Say the second string is "maria". Pairs: "ma", "ar", "ri", "ia".
    3. Count how many pairs are the same: "ma" and "ar".
    4. The similarity is computed as: 2 * same_pairs / (pairs_1 + pairs_2) = 2*2/(4+4) = 0.5
    

    The great thing about this algorithm is that it considers the ordering of the strings by using letter-pairs, which are ordered. Then on the other hand it does not consider ordering, because all pairs can be mixed in any order. This makes it perfect for my specific use case, where similar to your situation, I want to treat "abc def" as exact match to "def abc".

    You can also modify the algorithm. Part of my use-case is searching a short string in a longer text. For this I will use the similarity measure same_pairs / pairs_1, number of same pairs over pair count in the search term. If a word appears in the text, this gives me similarity 1.0, no matter how long the search text is. The similarity described above will penalize longer texts in this case.

    Another note on the implementation. You can make it really efficient and robust. I first reduced my alphabet to lower-case letters, digits and space. That is 26+10+1 = 37 symbols. I map A to a, ä to a etc. The number of possible letter pairs is now 37*37 = 1369. The way I count my pairs is by using an array pairCounts [37*37]byte where pair aa maps to 0, ab maps to 1 etc. So if I find pair ac in the string, ac maps to index 2, so I say pairCounts[2]++. Each pair can appear 256 times (I use byte) this way.

    Normalizing the letters make it case-insensitive, using an array instead of a full-blown map makes it run really fast.