I have this text:
start start word word word end
and this regex with flag gm
:
start.*?end
In my understanding the lazy quantifier *?
should limit the match to only:
start word word word end
However it still matches the whole string. However if the string is:
start word word word end end
Then it works correctly. Why is that?
From https://www.regular-expressions.info/engine.html:
This is a very important point to understand: a regex engine always returns the leftmost match, even if a “better” match could be found later. When applying a regex to a string, the engine starts at the first character of the string. It tries all possible permutations of the regular expression at the first character. Only if all possibilities have been tried and found to fail, does the engine continue with the second character in the text. Again, it tries all possible permutations of the regex, in exactly the same order. The result is that the regex engine returns the leftmost match.
There is a related "leftmost longest" rule, which not all regex engines use.
For example the POSIX Regular Expressions definition, under "matched" says:
The search for a matching sequence starts at the beginning of a string and stops when the first sequence matching the expression is found, where first is defined to mean "begins earliest in the string". If the pattern permits a variable number of matching characters and thus there is more than one such sequence starting at that point, the longest such sequence will be matched. For example: the BRE bb* matches the second to fourth characters of abbbc, and the ERE (wee|week)(knights|night) matches all ten characters of weeknights.
$ echo hello | grep -Eo 'hel|hell|hello'
hello
$ echo hello | grep -Eo 'hell|hello|hel'
hello
In contrast, Perl tries alternations from left to right and chooses the first that matches:
$ echo hello | grep -Po 'hel|hell|hello'
hel
$ echo hello | grep -Po 'hell|hello|hel'
hell