I'm trying to find all occurrences of a word in paragraph and I want it to account for spelling mistakes as well. Code:
to_search="caterpillar"
search_here= "caterpillar are awesome animal catterpillar who like other humans but not other caterpilar"
#search_here has the word caterpillar repeated but with spelling mistakes
s= SequenceMatcher(None, to_search, search_here).get_matching_blocks()
print(s)
#Output : [Match(a=0, b=0, size=11), Match(a=3, b=69, size=0)]
#Expected: [Match(a=0, b=0, size=11), Match(a=0, b=32, size=11), Match(a=0, b=81, size=11)]
Difflib get_matching_blocks only detects the first instance of "caterpillar" in the search_here string. I want it to give me output of all closely matching blocks i.e. it should identify "caterpillar","catterpillar" and "caterpilar"
How can I solve this problem?
You could calculate the edit distance of each word vs. to_search. Then, you can select all the words that have "low enough" edit distance (a score of 0 means exact match).
Thanks to your question, I have discovered that there is a pip-install-able edit_distance Python module. Here are a couple examples I just tried out for the first time:
>>> edit_distance.SequenceMatcher('fabulous', 'fibulous').ratio()
0.875
>>> edit_distance.SequenceMatcher('fabulous', 'wonderful').ratio()
0.11764705882352941
>>> edit_distance.SequenceMatcher('fabulous', 'fabulous').ratio()
1.0
>>> edit_distance.SequenceMatcher('fabulous', '').ratio()
0.0
>>> edit_distance.SequenceMatcher('caterpillar', 'caterpilar').ratio()
0.9523809523809523
So, it looks like the ratio method gives you a number between 0 and 1 (inclusive) where 1 is an exact match and 0 is... not even in the same league XD. So yeah, you could select words that have a ratio greater than 1 - epsilon, where epsilon is maybe 0.1 or thereabouts.