I'm attempting to parse a Wikipedia file dump with RegEx.
I want to match and remove everything between a set of brackets, including the brackets themselves. I also want to be able to check if the first word after the opening bracket is a certain word, and do not delete it if it is. In my case, a single bracket consists of two characters, say {{
and }}
.
For example, take the following sequence into consideration:
{{{{}}{{}}{{}}}} Don't delete me {{notmeeither}}
Using the following regex:
{{(?!(notmeeither))(.|\n)*?\}}
results in matching the first {{{{}}
, resulting in leftover brackets. Making the match greedy does not help, as it affects the text in between as well as the text not supposed to be matched. How would I go about this?
With the regex package you can specify recursive patterns:
>>> import regex
>>> regex.sub(r"{{(?!(notmeeither))((?>[^{}]+|(?R))*)}}","","{{{{}}{{}}{{}}}} Don't delete me {{notmeeither}}")
" Don't delete me {{notmeeither}}"