regexreluctant-quantifiers

Writing better regex expression for not using lazy repeat quantifier


I have a regular expression:

(<select([^>]*>))(.*?)(</select\s*>)

Since it uses lazy repeat quantifier, for longer strings(having options more than 500) it backtracks for more than 100,000 times and fails. Please help me to find a better regular expression which doesn't use lazy repeat quantifier


Solution

  • <select[^>]*>[^<]*(?:<(?!/select>)[^<]*)*</select>
    

    ...or in human-readable form:

    <select[^>]*>    # start tag
    [^<]*            # anything except opening bracket
    (?:              # if you find an open bracket
      <(?!/select>)  #   match it if it's not part of end tag
      [^<]*          #   consume any more non-brackets
    )*               # repeat as needed
    </select>        # end tag
    

    This is an example of the "unrolled loop" technique Friedl develops in his book, Mastering Regular Expressions. I did a quick test in RegexBuddy using a pattern based on reluctant quantifiers:

    (?s)<select[^>]*>.*?</select>
    

    ...and it took about 6,000 steps to find a match. The unrolled-loop pattern took only 500 steps. And when I removed the closing bracket from the end tag (</select), making a match impossible, it required only 800 steps to report failure.

    If your regex flavor supports possessive quantifiers, go ahead and use them, too:

    <select[^>]*+>[^<]*+(?:<(?!/select>)[^<]*+)*+</select>
    

    It takes about the same number of steps to achieve a match, but it can use a lot less memory in the process. And if no match is possible, it fails even more quickly; in my tests it took about 500 steps, the same number it took to find a match.