regexpython-re

Unexpected behaviour of the regex "{m, n}?$"


Consider the following example

>>> import sys, re
>>> sys.version
'3.11.5 (main, Sep 11 2023, 13:23:44) [GCC 11.2.0]'
>>> re.__version__
'2.2.1'
>>> re.findall('a{1,4}', 'aaaaa')
['aaaa', 'a']
>>> re.findall('a{1,4}?', 'aaaaa')
['a', 'a', 'a', 'a', 'a']
>>> re.findall('a{1,4}?$', 'aaaaa')
['aaaa']
>>> 

I expect to see a single 'a' in the last result, but instead I got 'aaaa'. How is this behaviour explained?


Solution

  • A regex will return the earliest possible match. It first tries at index 0, where it first tries a$, then aa$, aaa$ and aaaa$ (the order is implied by the ? suffix). But all of them fail. Then it tries starting at index 1, where it goes through the same sequence. A match is found when trying aaaa$ at index 1. It never gets to try other start indices. The ? has an effect on the length of the match for a given start index, not on the start index of the match.

    If you'd want to match the least number of "a" that is a suffix of the input, then you'd just test for one "a", with a$.