python-3.xregexregex-recursion

Regex for getting multiple words after a delimiter


I have been trying to get the separate groups from the below string using regex in PCRE:

drop = blah blah blah something keep = bar foo nlah aaaa rename = (a=b d=e) obs=4 where = (foo > 45 and bar == 35)

Groups I am trying to make is like:

1. drop = blah blah blah something
2. keep = bar foo nlah aaaa
3. rename = (a=b d=e)
4. obs=4
5. where = (foo > 45 and bar == 35)

I have written a regex using recursion but for some reason recursion is partially working for selecting multiple words after drop like it's selecting just first 3 words (blah blah blah) and not the 4th one. I have looked through various stackoverflow questions and have tried using positive lookahead also but this is the closest I could get to and now I am stuck because I am unable to understand what I am doing wrong.

The regex that I have written: (?i)(drop|keep|where|rename|obs)\s*=\s*((\w+|\d+)(\s+\w+)(?4)|(\((.*?)\)))

Same can be seen here: RegEx Demo.

Any help on this or understanding what I am doing wrong is appreciated.


Solution

  • You can use a branch reset group solution:

    (?i)\b(drop|keep|where|rename|obs)\s*=\s*(?|(\w+(?:\s+\w+)*)(?=\s+\w+\s+=|$)|\((.*?)\))
    

    See the PCRE regex demo

    Details

    See Python demo:

    import regex
    pattern = r"(?i)\b(drop|keep|where|rename|obs)\s*=\s*(?|(\w+(?:\s+\w+)*)(?=\s+\w+\s+=|$)|\((.*?)\))"
    text = "drop = blah blah blah something keep = bar foo nlah aaaa rename = (a=b d=e) obs=4 where = (foo > 45 and bar == 35)"
    print( [x.group() for x in regex.finditer(pattern, text)] )
    # => ['drop = blah blah blah something', 'keep = bar foo nlah aaaa', 'rename = (a=b d=e)', 'obs=4', 'where = (foo > 45 and bar == 35)']
    print( regex.findall(pattern, text) )
    # => [('drop', 'blah blah blah something'), ('keep', 'bar foo nlah aaaa'), ('rename', 'a=b d=e'), ('obs', '4'), ('where', 'foo > 45 and bar == 35')]