pythonstringdifflibsequencematcher

Fuzzy string matching using Difflib get_matching_blocks not detecting all substrings


I'm trying to find all occurrences of a word in paragraph and I want it to account for spelling mistakes as well. Code:

to_search="caterpillar"
search_here= "caterpillar are awesome animal catterpillar who like other humans but not other caterpilar"
#search_here has the word caterpillar repeated but with spelling mistakes

s= SequenceMatcher(None, to_search, search_here).get_matching_blocks()
print(s)

#Output  : [Match(a=0, b=0, size=11), Match(a=3, b=69, size=0)] 
#Expected: [Match(a=0, b=0, size=11), Match(a=0, b=32, size=11), Match(a=0, b=81, size=11)]

Difflib get_matching_blocks only detects the first instance of "caterpillar" in the search_here string. I want it to give me output of all closely matching blocks i.e. it should identify "caterpillar","catterpillar" and "caterpilar"

How can I solve this problem?


Solution

  • You could calculate the edit distance of each word vs. to_search. Then, you can select all the words that have "low enough" edit distance (a score of 0 means exact match).

    Thanks to your question, I have discovered that there is a pip-install-able edit_distance Python module. Here are a couple examples I just tried out for the first time:

    >>> edit_distance.SequenceMatcher('fabulous', 'fibulous').ratio()
    0.875
    >>> edit_distance.SequenceMatcher('fabulous', 'wonderful').ratio()
    0.11764705882352941
    >>> edit_distance.SequenceMatcher('fabulous', 'fabulous').ratio()
    1.0
    >>> edit_distance.SequenceMatcher('fabulous', '').ratio()
    0.0
    >>> edit_distance.SequenceMatcher('caterpillar', 'caterpilar').ratio()
    0.9523809523809523
    

    So, it looks like the ratio method gives you a number between 0 and 1 (inclusive) where 1 is an exact match and 0 is... not even in the same league XD. So yeah, you could select words that have a ratio greater than 1 - epsilon, where epsilon is maybe 0.1 or thereabouts.