regexpcrerecursive-regex

Recursive PCRE search with patterns


This question has to do with PCRE .

I have seen a recursive search for nested parentheses used with this construct:

\(((?>[^()]+)|(?R))*\)

The problem with this is that, while the '[^()]+' can match any character including newline, you are forced to match only single-character characters, such as braces, brackets, punctuation, single letters, etc.

What I am trying to do is replace the '(' and ')' characters with ANY kind of pattern (keywords such as 'BEGIN' and 'END', for example).

I have come up with the following construct:

(?xs)  (?# <-- 'xs' ignore whitespace in the search term, and allows '.'
               to match newline )
(?P<pattern1>BEGIN)
(
   (?> (?# <-- "once only" search )
      (
         (?! (?P=pattern1) | (?P<pattern2>END)).
      )+
   )
   | (?R)
)*
END

This will actually work on something that looks like this:

BEGIN <<date>>
  <<something>
    BEGIN
      <<something>>
    END <<comment>>
    BEGIN <<time>>
      <<more somethings>>
      BEGIN(cause we can)END
      BEGINEND
    END
  <<something else>>
END

This successfully matches any nested BEGIN..END pairs.

I set up named patterns pattern1 and pattern2 for BEGIN and END, respectively. Using pattern1 in the search term works fine. However, I can't use pattern2 at the end of the search: I have to write out 'END'.

Any idea how I can rewrite this regex so I only have to specify the patterns a single time and use them "everywhere" within the code? In other words, so I don't have to write END both in the middle of the search as well as at the very end.


Solution

  • To further extend on @Kobis answer, please see the following regex:

    (?xs)
    (?(DEFINE)
            (?<pattern1>BEGIN)
            (?<pattern2>END)
    )
    (?=((?&pattern1)
    (?:
       (?> (?# <-- "once only" search )
          (?:
             (?! (?&pattern1) | (?&pattern2)) .
          )+
       )*
       | (?3)
    )*
    (?&pattern2)
    ))
    

    This regex will allow you to even fetch the data for each individual data block! Use the 3rd backreference, as the first two have been defined in the define block.

    Demo: http://regex101.com/r/bX8mB6