python regex regex-lookarounds python-regex regex-recursion

How can I use a recursive regex or another method to recursively validate this BBcode-like markup in Python?

I am attempting to write a program that validates documents written in a markup language similar to BBcode.

This markup language has both matching ([b]bold[/b] text) and non-matching (today is [date]) tags. Unfortunately, using a different markup language is not an option.

However, my regex is not acting the way I want it to. It seems to always stop at the first matching closing tag instead of identifying that nested tag with the recursive (?R).

I am using the regex module, which supports (?R), and not re.

My questions are:

How can I effectively use a recursive regex to match nested tags without terminating on the first tag?
If there's a better method than a regular expression, what is that method?

Here is the regex once I build it: \[(b|i|u|h1|h2|h3|large|small|list|table|grid)\](?:((?!\[\/\1\]).)*?|(?R))*\[\/\1\]

Here is a test string that doesn't work as expected: [large]test1 [large]test2[/large] test3[/large] (it should match this whole string but stops before test3)

Here is the regex on regex101.com: https://regex101.com/r/laJSLZ/1

This test doesn't need to finish in milliseconds or even seconds, but it does need to be able to validate about 100 files of 1,000 to 10,000 characters each in a time that is reasonable for a Travis-CI build.

Here is what the logic using this regex looks like, for context:

import io, regex # https://pypi.org/project/regex/

# All the tags that must have opening and closing tags
matching_tags = 'b', 'i', 'u', 'h1', 'h2', 'h3', 'large', 'small', 'list', 'table', 'grid'

# our first part matches an opening tag:
# \[(b|i|u|h1|h2|h3|large|small|list|table|grid)\]
# our middle part matches the text in the middle, including any properly formed tag sets in between:
# (?:((?!\[\/\1\]).)*?|(?R))*
# our last part matches the closing tag for our first match:
# \[\/\1\]
pattern = r'\[(' + '|'.join(matching_tags) + r')\](?:((?!\[\/\1\]).)*?|(?R))*\[\/\1\]'
myRegex = re.compile(pattern)

data = ''
with open('input.txt', 'r') as file:
    data = '[br]'.join(file.readlines())

def validate(text):
    valid = True
    for node in all_nodes(text):
        valid = valid and is_valid(node)
    return valid

# (Only important thing here is that I call this on every node, this
# should work fine but the regex to get me those nodes does not.)
# markup should be valid iff opening and closing tag counts are equal
# in the whole file, in each matching top-level pair of tags, and in
# each child all the way down to the smallest unit (a string that has
# no tags at all)
def is_valid(text):
    valid = True
    for tag in matching_tags:
        valid = valid and text.count(f'[{tag}]') == text.count(f'[/{tag}]')
    return valid

# this returns each child of the text given to it
# this call:
# all_nodes('[b]some [large]text to[/large] validate [i]with [u]regex[/u]![/i] love[/b] to use [b]regex to [i]do stuff[/i][/b]')
# should return a list containing these strings:
# [b]some [large]text to[/large] validate [i]with [u]regex[/u]![/i] love[/b]
# [large]text to[/large]
# [i]with [u]regex[/u]![/i]
# [u]regex[/u]
# [b]regex to [i]do stuff[/i][/b]
# [i]do stuff[/i]
def all_nodes(text):
    matches = myRegex.findall(text)
    if len(matches) > 0:
        for m in matches:
            result += all_nodes(m)
    return result

exit(0 if validate(data) else 1)

Solution

Your main issue is within the ((?!\[\/\1\]).)*? tempered greedy token.

First, it is inefficient since you quantified it and then you quantify the whole group it is in, so making the regex engine look for more ways to match a string, and that makes it rather fragile.

Second, you only match up to the closing tag and you did not restrict the starting tag. The first step is to make the / before \1 optional, \/?. It won't stop before [tag] like tags with no attributes. To add attribute support, add an optional group after \1, (?:\s[^]]*)?. It matches an optional sequence of a whitespace and then any 0+ chars other than ].

A fixed regex will look like

\[([biu]|h[123]|l(?:arge|ist)|small|table|grid)](?:(?!\[/?\1(?:\s[^]]*)?]).|(?R))*\[/\1]

Do not forget to compile it with regex.DOTALL to match across multiple newlines.