pythonregexsplitshlex

Python regular expression for multiple split criteria


I'm struggling to split some text in a piece of code that I'm writing. This software is scanning through about 3.5 million lines of text of which there are varying formats throughout.

I'm kind of working my way through everything still, but the line below appears to be fairly standard within the file:

EXAMPLE_FILE_TEXT ID="20211111.111111 11111"

I want to split it as follows:

EXAMPLE_FILE_TEXT, ID, 20211111.111111 11111

As much as possible, I'd prefer to avoid hard coding any certain text to look for as I'm still parsing through the file & trying to determine all the different variables. I've tried running the following code:

conditioned_line = re.sub(r'(\w+=)(\w+)', r'\1"\2"', input_line)
output = shlex.split(conditioned_line)

When I run this code, I'm getting this output:

['EXAMPLE_FILE_TEXT', 'ID=20211111.111111 11111']

I've managed to successfully split each and every element of this, but I have not managed to split them all together successfully. I suspect this is manageable via a regular expression, or with a regular expression and a shlex split, but I could really use some suggestions if anyone has some ideas.

As requested, here's another example of some text that's in the file I'm scanning:

EXAMPLE_TEXT TAG="AB-123-ABCD_$B" ABCDE_ABCD="ABCD_A" ABCDEF_ABCDE="ABCDEF_ABCDEF_$A" ABCDEFGH=""

This should separate to the following:

EXAMPLE_TEXT, TAG, AB-123-ABCD_$B, ABCDE_ABCD, ABCD_A, ABCDEF_ABCDE, ABCDEF_ABCDEF_$A, ABCDEFGH

Solution

  • I suggest a tokenizing approach with regex: create a regex with alternations, starting with the most specific ones, and ending with somewhat generic ones.

    In your case, you may try

    import re
    x = 'EXAMPLE_FILE_TEXT ID="20211111.111111 11111"'
    res = re.findall(r'"([^"]*)"|(\d+(?:\.\d+)*)|(\w+)', x)
    print( ["".join(r) for r in res] )
    # => ['EXAMPLE_FILE_TEXT', 'ID', '20211111.111111 11111']
    

    See the Python demo.

    The regex matches