pythonregex

Extract uncaptured raw text from regex


I am given a regex expression that consists of raw text and capture groups. How can I extract all raw text snippets from it?

For example:

pattern = r"Date: (\d{4})-(\d{2})-(\d{2})"
assert extract(pattern) == ["Date: ", "-", "-", ""]

Here, the last entry in the result is an empty string, indicating that there is no raw text after the last capture group.

The solution should not extract raw text within capture groups:

pattern = r"hello (world)"
assert extract(pattern) == ["hello ", ""]

The solution should work correctly with escaped characters too, for example:

pattern = r"\(born in (.*)\)"
assert extract(pattern) == ["(born in ", ")"]

Ideally, the solution should be efficient, avoiding looping over the string in Python.


Solution

  • What you are asking for is to extract literal tokens from a parsed regex pattern at the top level.

    If you don't mind tapping into the internals of the re package, you can see from the list of tokens of a given pattern parsed by re._parser.parse:

    import re
    
    pattern = r"\(born in (.*)\)"
    print(*re._parser.parse(pattern).data, sep='\n')
    

    which outputs:

    (LITERAL, 40)
    (LITERAL, 98)
    (LITERAL, 111)
    (LITERAL, 114)
    (LITERAL, 110)
    (LITERAL, 32)
    (LITERAL, 105)
    (LITERAL, 110)
    (LITERAL, 32)
    (SUBPATTERN, (1, 0, 0, [(MAX_REPEAT, (0, MAXREPEAT, [(ANY, None)]))]))
    (LITERAL, 41)
    

    that all you need is to group together the LITERAL tokens and join their codepoints for output:

    def extract(pattern):
        literal_groups = [[]]
        for op, value in re._parser.parse(pattern).data:
            if op is re._constants.LITERAL:
                literal_groups[-1].append(chr(value))
            else:
                literal_groups.append([])
        return list(map(''.join, literal_groups))
    

    so that:

    for pattern in (
        r"Date: (\d{4})-(\d{2})-(\d{2})",
        r"hello (world)",
        r"\(born in (.*)\)"
    ):
        print(extract(pattern))
    

    outputs:

    ['Date: ', '-', '-', '']
    ['hello ', '']
    ['(born in ', ')']
    

    Demo here