pythonregexpython-regex

Failing to match number ranges with pattern declared in DEFINE block using PyPi regex package


I'm using https://github.com/mrabarnett/mrab-regex (via pip install regex, but experiencing a failure here:

pattern_string =  r'''
        (?&N)
        ^ \W*? ENTRY              \W* (?P<entries>    (?&Range)    )     (?&N)

        (?(DEFINE)
             (?P<Decimal>
                 [ ]*? \d+ (?:[.,] \d+)? [ ]*?
             )
             (?P<Range>
                 (?&Decimal) - (?&Decimal) | (?&Decimal)
                 #(?&d) (?: - (?&d))?
             )
             (?P<N>
                 [\s\S]*?
             )
        )
    '''

flags = regex.MULTILINE | regex.VERBOSE  #| regex.DOTALL  | regex.V1 #| regex.IGNORECASE | regex.UNICODE

pattern = regex.compile(pattern_string, flags=flags)

bk2 = f'''
ENTRY: 0.0975 - 0.101
'''.strip()
match = pattern.match('ENTRY: 0.0975 - 0.101')
match.groupdict()

gives:

{'entries': '0.0975', 'Decimal': None, 'Range': None, 'N': None}

It misses the second value.

> pip show regex
Name: regex
Version: 2022.1.18
Summary: Alternative regular expression module, to replace re.
Home-page: https://github.com/mrabarnett/mrab-regex
Author: Matthew Barnett
Author-email: regex@mrabarnett.plus.com
License: Apache Software License
Location: ...
Requires:
Required-by:

> python --version
Python 3.10.0

Solution

  • The problem is that the spaces you defined in the Decimal group pattern are consumed, and the DEFINE patterns are atomic, so although the last [ ]*? part is lazy and can match zero times, once it matches, there is no going back. You can check this if you put the Decimal pattern into an atomic group and compare two patterns, cf. this regex demo and this regex demo. (?mx)^\W*?ENTRY\W*(?P<entries>(?>[ ]*? \d+ (?:[.,] \d+)? [ ]*?) - (?>[ ]*? \d+ (?:[.,] \d+)? [ ]*?) | (?>[ ]*? \d+ (?:[.,] \d+)? [ ]*?)) exposes the same behavior as your regex with DEFINE block, while (?mx)^\W*?ENTRY\W*(?P<entries>[ ]*? \d+ (?:[.,] \d+)? [ ]*? - [ ]*? \d+ (?:[.,] \d+)? [ ]*? | [ ]*? \d+ (?:[.,] \d+)? [ ]*?) finds the match correctly.

    The easiest fix is to move the optional space patterns into the Range group pattern.

    There are other minor enhancements you might want to introduce here:

    So, the regex can look like

    ^ \W* ENTRY              \W* (?P<entries>    (?&Range)    ) 
    (?(DEFINE)
        (?P<Decimal>
            \d+ (?:[.,] \d+)?
        )
        (?P<Range>
            (?&Decimal)(?:\ *-\ *(?&Decimal))*
        )
    )
    

    ā€‹ See the regex demo.

    See the Python demo:

    import regex
    pattern_string =  r'''
            ^ \W* ENTRY              \W* (?P<entries>    (?&Range)    )
    
            (?(DEFINE)
                 (?P<Decimal>
                     \d+ (?:[.,] \d+)?
                 )
                 (?P<Range>
                     (?&Decimal)(?:\ *-\ *(?&Decimal))?
                 )
            )
        '''
    
    flags = regex.MULTILINE | regex.VERBOSE
    pattern = regex.compile(pattern_string, flags=flags)
    
    bk2 = f'''
    ENTRY: 0.0975 - 0.101
    '''.strip()
    
    match = pattern.search('ENTRY: 0.0975 - 0.101')
    
    print(match.groupdict())
    

    Output:

    {'entries': '0.0975 - 0.101', 'Decimal': None, 'Range': None}