regexregex-greedynon-greedy

How do greedy / lazy (non-greedy) / possessive quantifiers work internally?


I noticed that there are 3 different classes of quantifiers: greedy, lazy (i.e. non-greedy) and possessive.

I know that, loosely speaking, greedy quantifiers try to get the longest match by first reading in the entire input string and then truncate the characters one by one if the attempts keep failing; lazy quantifiers try to get the shortest match by first reading in the empty string and then add in the characters one by one if the attempts keep failing; possessive quantifiers try the same way as greedy quantifiers while they will stop matching if the first attempt fails.

However, I'm not sure how exactly the aboves are being implemented 'internally', and would like to ask for clarification (hopefully with examples).


For example, say we have the input string as "fooaaafoooobbbfoo".

If the regex is "foo.*" (greedy), will the foo in the regex first match the foo in the input string, and then .* reads in aaafoooobbbfoo as 'the entire string'? Or will .* first read in fooaaafoooobbbfoo as 'the entire string', and then truncates fooaaafoooobbbfoo to try matching the foo in the regex? If it is the latter, will fooaaafoooobbbfoo be truncated from its left or from its right in each attempt?

Will the answers to the above questions change if I replace "foo.*" with ".*foo" or "foo.*foo" as my regex? What about if I change those greedy quantifiers to lazy ones and possessive ones?

And if there are more than one quantifiers in a regex, how will the engine deal with the priority (if that matters)?


Thanks in advance!


Solution

  • For your input string fooaaafoooobbbfoo.

    Case 1: When you're using this regex:

    foo.*
    

    First remember this fact that engine traverses from left to right.

    With that in mind above regex will match first foo which is at the start of input and then .* will greedily match longest possible match which is rest of the text after foo till end. At this point matching stops as there is nothing to match after .* in your pattern.

    Case 2: When you're using this regex:

    .*foo
    

    Here again .* will greedily match longest possible match before matching last foo which is right the end of input.

    Case 3: When you're using this regex:

    foo.*foo
    

    Which will match first foo found in input i.e. foo at the start then .* will greedily match longest possible match before matching last foo which is right the end of input.

    Case 4: When you're using this regex with lazy quantifier:

    foo.*?foo
    

    Which will match first foo found in input i.e. foo at the start then .*? will lazily match shortest possible match before matching next foo which is second instance of foo starting at position 6 in input.

    Case 5: When you're using this regex with possessive quantifier:

    foo.*+foo
    

    Which will match first foo found in input i.e. foo at the start then .*+ is using possessive quantifier which means match as many times as possible, without giving back. This will match greedily longest possible match till end and since possessive quantifier doesn't allow engine to backtrack hence presence of foo at the end of part will cause failure as engine will fail to match last foo.