Consider the following example
>>> import sys, re
>>> sys.version
'3.11.5 (main, Sep 11 2023, 13:23:44) [GCC 11.2.0]'
>>> re.__version__
'2.2.1'
>>> re.findall('a{1,4}', 'aaaaa')
['aaaa', 'a']
>>> re.findall('a{1,4}?', 'aaaaa')
['a', 'a', 'a', 'a', 'a']
>>> re.findall('a{1,4}?$', 'aaaaa')
['aaaa']
>>>
I expect to see a single 'a'
in the last result, but instead I got 'aaaa'
. How is this behaviour explained?
A regex will return the earliest possible match. It first tries at index 0, where it first tries a$
, then aa$
, aaa$
and aaaa$
(the order is implied by the ?
suffix). But all of them fail. Then it tries starting at index 1, where it goes through the same sequence. A match is found when trying aaaa$
at index 1. It never gets to try other start indices. The ?
has an effect on the length of the match for a given start index, not on the start index of the match.
If you'd want to match the least number of "a" that is a suffix of the input, then you'd just test for one "a", with a$
.