pythonregexfuzzy-search

Python fuzzy string search with `regex`


Trying to understand fuzzy pattern matching with regex. What I want: I have a string, and I want to find identical or similar strings in other, perhaps larger strings. (Does one field in a database record occur, perhaps as a fuzzy substring, in any other field in that database record?)

Here's a sample. Comments indicate character positions.

import regex
to_search = "1990 /"
            #123456
            # ^^ ^
search_in = "V CAD-0000:0000[01] ISS 23/10/91"
            #12345678901234567890123456789012
            #                           ^^ ^
m = regex.search(f'({to_search}){{e<4}}', search_in, regex.BESTMATCH)

result:

>>> m
<regex.Match object; span=(27, 30), match='10/', fuzzy_counts=(0, 0, 3)>
>>> m.fuzzy_changes
([], [], [28, 29, 31])    

No substitutions, no insertions, 3 deletions at positions 28, 29 and 31. The order "substitutions insertions deletions" matters, it's taken from here.

Question: how to interpret this, in normal human language? What it says (I think):

"You have a match from substring 10/ in your search_in, if you delete positions 28, 29 and 31 in it."

I probably got that wrong. This is true tho':

"If you delete positions 5, 3 and 2, in that order, in to_search, you have an exact match at substring 10/ in search_in, yay!"

Fortunately, I found a guru! So I did

>>> import orc
>>> m = regex.search(f'({to_search}){{e<4}}', search_in, regex.BESTMATCH)
>>> m
<regex.Match object; span=(27, 30), match='10/', fuzzy_counts=(0, 0, 3)>
>>> near_match = orc.NearMatch.from_regex(m, to_search)
>>> print(near_match)
10/
 I
190/
  I
1990/
    I
1990 /

Hmm... so the order of fuzzy_counts, is in fact, something, something, insertions?

I'd appreciate if anyone could shed some light on this.


Solution

  • you are close. but according to the docs you mentioned in the post, this is what is going on here.

    import regex
    to_search = "1990 /"
                #123456
                # ^^ ^
    search_in = "V CAD-0000:0000[01] ISS 23/10/91"
                #12345678901234567890123456789012
                #                           ^^ ^
    m = regex.search(f'({to_search}){{e<4}}', search_in, regex.BESTMATCH)
    m
    

    output:

    <regex.Match object; span=(27, 30), match='10/', fuzzy_counts=(0, 0, 3)>
    
    
    m.fuzzy_changes
    

    output:

    ([], [], [28, 29, 31])
    
    

    EXPLAINATION

    let's break it down step by step:

    The Context:

    You're searching for the exact sequence "1990 /" within a longer text "V CAD-0000:0000[01] ISS 23/10/91".

    The Findings:

    The Analysis:

    To get an exact match we should have had the longer string as this

    V CAD-0000:0000[01] ISS 23/1990 /91

    However, there were a few changes made to that string to get the actual string.

    Changes:

    1. Deletions:
      • Locations: Positions 28, 29, and 31 in the presumed original sequence V CAD-0000:0000[01] ISS 23/1990 /91 were deleted.
      • Resultant String: After these deletions, the presumed original sequence became the actual sequence V CAD-0000:0000[01] ISS 23/10/91.