pythonsimilaritydifflibsequencematcher

Similarity ratio from a list of excluded strings


In comparing the similarity of 2 strings, I want to exclude a list of strings, for example, ignore 'Texas', and 'US'.

I tried to use the argument 'isjunk' in Difflib's SequenceMatcher:

exclusion = ['Texas', 'US']
sr = SequenceMatcher(lambda x: x in exclusion, 'Apple, Texas, US', 'Orange, Texas, US', autojunk=True).ratio()

print (sr)

The similarity ratio is high as 0.72, so obviously it's not excluding the strings unwanted.

What is the right way to do this?


Solution

  • I'm not familiar with the package, but as a curious person I googled it a bit, and explored it a bit with some self examples. I found something interesting, which is not a solution to your problem, it is more an excuse to the results you were recieved.

    as I found here:

    ratio( ) returns the similarity score ( float in [0,1] ) between input strings. It sums the sizes of all matched sequences returned by function get_matching_blocks and calculates the ratio as: ratio = 2.0*M / T , where M = matches , T = total number of elements in both sequences

    so let's take a look of an example:

    from difflib import SequenceMatcher
    exclusion = ['Texas', 'US']
    a = 'Apple, Texas, US'
    b = 'Orange, Texas, US'
    sr = SequenceMatcher(lambda x: x in exclusion, a, b, autojunk=True)
    matches = sr.get_matching_blocks()
    M = sum([match[2] for match in matches])
    print(matches)
    ratio = 2*M/(len(a) + len(b))
    print(f'ratio calculated: {ratio}')
    print(sr.ratio())
    

    I got this:

    [Match(a=4, b=5, size=12), Match(a=16, b=17, size=0)]
    ratio calculated: 0.7272727272727273
    0.7272727272727273
    

    So then for this example, I would expected to get the same result:

    a = 'Apple, Texas, USTexasUS'
    b = 'Orange, Texas, US'
    

    I was expected that the extra TexasUS will ignored since it in exclusion list, and then the ratio will remain the same, let's see what we got:

    [Match(a=4, b=5, size=12), Match(a=23, b=17, size=0)]
    ratio calculated: 0.6
    0.6
    

    the ration is less than the first example, it does not make any sense. but if we will take a deep look at the output we will see that the matches are totally the same! so what the differences? the length of the strings (it calculate it along with the excluded strings)! if we will stick the naming convention from the link, T is bigger now:

    T2>T1 ----> ratio2<ratio1
    

    I can suggest you to filter the words by yourself before match them as like here:

    exclusion = ['Texas', 'US']
    a = 'Apple, Texas, USTexasUS'
    b = 'Orange, Texas, US'
    for word2exclude in exclusion:
        a = a.replace(word2exclude,'')
        b = b.replace(word2exclude,'')
    sr = SequenceMatcher(None, a, b)
    

    Hope you'll find it useful, maybe not to solve your problem, but to understand it (understanding an issue is the first step to the solution!)