pythonregexwikitext

Properly match nested brackets with regex in Python


I'm attempting to parse a Wikipedia file dump with RegEx.

I want to match and remove everything between a set of brackets, including the brackets themselves. I also want to be able to check if the first word after the opening bracket is a certain word, and do not delete it if it is. In my case, a single bracket consists of two characters, say {{ and }}.

For example, take the following sequence into consideration:

{{{{}}{{}}{{}}}} Don't delete me {{notmeeither}}

Using the following regex:

{{(?!(notmeeither))(.|\n)*?\}}

results in matching the first {{{{}}, resulting in leftover brackets. Making the match greedy does not help, as it affects the text in between as well as the text not supposed to be matched. How would I go about this?


Solution

  • With the regex package you can specify recursive patterns:

    >>> import regex
    >>> regex.sub(r"{{(?!(notmeeither))((?>[^{}]+|(?R))*)}}","","{{{{}}{{}}{{}}}} Don't delete me {{notmeeither}}")
    " Don't delete me {{notmeeither}}"