Trying to understand fuzzy pattern matching with regex. What I want: I have a string, and I want to find identical or similar strings in other, perhaps larger strings. (Does one field in a database record occur, perhaps as a fuzzy substring, in any other field in that database record?)
Here's a sample. Comments indicate character positions.
import regex
to_search = "1990 /"
#123456
# ^^ ^
search_in = "V CAD-0000:0000[01] ISS 23/10/91"
#12345678901234567890123456789012
# ^^ ^
m = regex.search(f'({to_search}){{e<4}}', search_in, regex.BESTMATCH)
result:
>>> m
<regex.Match object; span=(27, 30), match='10/', fuzzy_counts=(0, 0, 3)>
>>> m.fuzzy_changes
([], [], [28, 29, 31])
No substitutions, no insertions, 3 deletions at positions 28, 29 and 31. The order "substitutions insertions deletions" matters, it's taken from here.
Question: how to interpret this, in normal human language? What it says (I think):
"You have a match from substring
10/
in yoursearch_in
, if you delete positions 28, 29 and 31 in it."
I probably got that wrong. This is true tho':
"If you delete positions 5, 3 and 2, in that order, in
to_search
, you have an exact match at substring10/
insearch_in
, yay!"
Fortunately, I found a guru! So I did
>>> import orc
>>> m = regex.search(f'({to_search}){{e<4}}', search_in, regex.BESTMATCH)
>>> m
<regex.Match object; span=(27, 30), match='10/', fuzzy_counts=(0, 0, 3)>
>>> near_match = orc.NearMatch.from_regex(m, to_search)
>>> print(near_match)
10/
I
190/
I
1990/
I
1990 /
Hmm... so the order of fuzzy_counts
, is in fact, something, something, insertions?
I'd appreciate if anyone could shed some light on this.
you are close. but according to the docs you mentioned in the post, this is what is going on here.
import regex
to_search = "1990 /"
#123456
# ^^ ^
search_in = "V CAD-0000:0000[01] ISS 23/10/91"
#12345678901234567890123456789012
# ^^ ^
m = regex.search(f'({to_search}){{e<4}}', search_in, regex.BESTMATCH)
m
output:
<regex.Match object; span=(27, 30), match='10/', fuzzy_counts=(0, 0, 3)>
m.fuzzy_changes
output:
([], [], [28, 29, 31])
let's break it down step by step:
You're searching for the exact sequence "1990 /" within a longer text "V CAD-0000:0000[01] ISS 23/10/91".
To get an exact match we should have had the longer string as this
V CAD-0000:0000[01] ISS 23/1990 /91
However, there were a few changes made to that string to get the actual string.
V CAD-0000:0000[01] ISS 23/1990 /91
were deleted.V CAD-0000:0000[01] ISS 23/10/91
.