phpregexregex-lookaroundsreluctant-quantifiers

RegExp exercise: reluctant quantifier with a lookahead assertion


Can you explain me how this works? Here is an example:

<!-- The quick brown fox 
              jumps over the lazy dog -->

<!--[if IE 7]>
    <link rel="stylesheet" type="text/css" href="/supersheet.css" />
<![endif]-->

<!-- Pack my box with five dozen liquor jugs -->

First, I tried to use the following regular expression to match the content inside conditional comments:

/<!--.*?stylesheet.*?-->/s

It failed, as the regular expression matches all the content before the first <!-- and the last -->. Then I tried using another pattern with a lookahead assertion:

/<!--(?=.*?stylesheet).*?-->/s

It works and matches exactly what I need. However, the following regular expression works as well:

/<!--(?=.*stylesheet).*?-->/s

The last regular expression does not have a reluctant quantifier in the lookahead assertion. And now I am confused. Can anyone explain me how it works? Maybe there is a better solution for this example?

Updated:

I tried usig the regular expressions with lookahead assertion in another document, and it failed to mach the content between the comments. So, this one /<!--(?=.*?stylesheet).*?-->/s (as well as this one /<!--(?=.*stylesheet).*?-->/s) is not correct. Do not use it and try other suggestions.

Updated:

The solution has been found by Jonny 5 (see the answer). He suggested three options:

  1. Using of a negated hyphen to limit match. This option works only if there is no a hyphen between the tags. If a stylesheet has an URL /style-sheet.css, it will not work.
  2. Using of escape sequence: \K. It works like a charm. The downsides are the following:
    • It is terribly slow (in my case, it was 8-10 times slower than the other solutions)
    • Only available since PHP 5.2.4
  3. Using a lookahead to narrow the match. This is the goal I tried to achieve, but my expirience of using lookaround assertions was insufficient to perform the task.

I think the following is a good solution for my example:

/(?s)<!--(?:(?!<!).)+?stylesheet.+?-->/

The same but with the s modifier at the end:

/<!--(?:(?!<!).)+?stylesheet.+?-->/s

As I said, this is a good solution, but I managed to improve the pattern and found another one that in my case works faster.

So, the final solution is the following:

/<!--(?:(?!-->).)+?stylesheet.+?-->/s

Thanks all the participants for interesting answers.


Solution

  • To match only the part <!--...stylesheet...--> there are many ways:

    1.) Use a negated hyphen [^-] to limit the match and stay in between <!-- and stylesheet

    (?s)<!--[^-]+stylesheet.+?-->
    

    [^-] allows only characters, that are not a hyphen. See test at regex101.


    2.) To get the "last" or closest match without much regex effort, also can put a greedy dot before to ᗧ eat up. Makes sense if not matching globally / only one item to match. Use \K to reset after the greed:

    (?s)^.*\K<!--.+?stylesheet.+?-->
    

    See test at regex101. Also can use a capture group and grab $1: (?s)^.*(<!--.+?stylesheet.+?-->)


    3.) Using a lookahead to narrow it down is usually more costly:

    (?s)<!--(?:(?!<!).)+?stylesheet.+?-->
    

    See test at regex101. (?!<!). looks ahead at each character in between <!-- and stylesheet if not starting another <!... to stay inside one element. Similar to the negated hyphen solution.


    Instead of .* I used .+ for one or more - depends on what to be matched. Here + fits better.
    What solution to use depends on the exact requirements. For this case I would use the first.