regexrecursive-regex

Regex for nested XML attributes


Lets say I have following string:

"<aa v={<dd>sop</dd>} z={ <bb y={ <cc x={st}>ABC</cc> }></bb> }></aa>"

How can I write general purpose regex (tag names change, attribute names change) to match content inside {}, either <dd>sop</dd> or <bb y={ <cc x={st}>ABC</cc> }></bb>.

Regex I wrote "(\s*\w*=\s*\{)\s*(<.*>)\s*(\})" matches

"<dd>sop</dd>} z={ <bb y={ <cc x={st}>ABC</cc> }></bb>" which is not correct.


Solution

  • In generic regex there's no way to handle nesting in a good way. Hence all the wining when a question like this comes up - never use regex to parse XML/HTML.

    In some simple cases it might be advantageous though. If, like in your example, there's a limited number of levels of nesting, you can quite simply add one regex for each level.

    Now let's do this in steps. To handle the first un-nested attribute you can use

    {[^}]*}
    

    This matches a starting brace followed by any number of anything but a closing brace, finally followed by a closing brace. For simplicity I'm gonna put the heart of it in a non capturing group, like

    {(?:[^}])*}
    

    This is because when inserting the alternate ones, it's needed.

    If you now allow for that anything but a closing brace ([^}]) to also be another nested level of braces and simply join with the first regex, like

    {(?:{[^}]*}|[^}])*}
        ^^^^^^^    original regex inserted as alternative (to it self)
    

    it allows for one level of nesting. Doing the same again, joining this regex as an alternative to itself, like

    {(?:{(?:{[^}]*}|[^}])*}|{[^}]*}|[^}])*}
            ^^^^^^^^^^^^^^^    previous level repeated
    

    will allow for another level of nesting. This can be repeated for more levels if wanted.

    This doesn't handle the capture of attribute names and stuff though, because your question isn't quite clear on what you want there, but it shows you one way (i.m.o. the easiest to understand, or... :P) to handle nesting in regex.

    You can see it handle your example here at regex101.

    Regards