[SOLVED] Properly match nested brackets with regex in Python

Properly match nested brackets with regex in Python

I'm attempting to parse a Wikipedia file dump with RegEx.

I want to match and remove everything between a set of brackets, including the brackets themselves. I also want to be able to check if the first word after the opening bracket is a certain word, and do not delete it if it is. In my case, a single bracket consists of two characters, say {{ and }}.

For example, take the following sequence into consideration:

{{{{}}{{}}{{}}}} Don't delete me {{notmeeither}}

Using the following regex:

{{(?!(notmeeither))(.|\n)*?\}}

results in matching the first {{{{}}, resulting in leftover brackets. Making the match greedy does not help, as it affects the text in between as well as the text not supposed to be matched. How would I go about this?

Solution

With the regex package you can specify recursive patterns:

>>> import regex
>>> regex.sub(r"{{(?!(notmeeither))((?>[^{}]+|(?R))*)}}","","{{{{}}{{}}{{}}}} Don't delete me {{notmeeither}}")
" Don't delete me {{notmeeither}}"