I'm struggling to split some text in a piece of code that I'm writing. This software is scanning through about 3.5 million lines of text of which there are varying formats throughout.
I'm kind of working my way through everything still, but the line below appears to be fairly standard within the file:
EXAMPLE_FILE_TEXT ID="20211111.111111 11111"
I want to split it as follows:
EXAMPLE_FILE_TEXT, ID, 20211111.111111 11111
As much as possible, I'd prefer to avoid hard coding any certain text to look for as I'm still parsing through the file & trying to determine all the different variables. I've tried running the following code:
conditioned_line = re.sub(r'(\w+=)(\w+)', r'\1"\2"', input_line)
output = shlex.split(conditioned_line)
When I run this code, I'm getting this output:
['EXAMPLE_FILE_TEXT', 'ID=20211111.111111 11111']
I've managed to successfully split each and every element of this, but I have not managed to split them all together successfully. I suspect this is manageable via a regular expression, or with a regular expression and a shlex split, but I could really use some suggestions if anyone has some ideas.
As requested, here's another example of some text that's in the file I'm scanning:
EXAMPLE_TEXT TAG="AB-123-ABCD_$B" ABCDE_ABCD="ABCD_A" ABCDEF_ABCDE="ABCDEF_ABCDEF_$A" ABCDEFGH=""
This should separate to the following:
EXAMPLE_TEXT, TAG, AB-123-ABCD_$B, ABCDE_ABCD, ABCD_A, ABCDEF_ABCDE, ABCDEF_ABCDEF_$A, ABCDEFGH
I suggest a tokenizing approach with regex: create a regex with alternations, starting with the most specific ones, and ending with somewhat generic ones.
In your case, you may try
import re
x = 'EXAMPLE_FILE_TEXT ID="20211111.111111 11111"'
res = re.findall(r'"([^"]*)"|(\d+(?:\.\d+)*)|(\w+)', x)
print( ["".join(r) for r in res] )
# => ['EXAMPLE_FILE_TEXT', 'ID', '20211111.111111 11111']
See the Python demo.
The regex matches
"([^"]*)"
- a string between two double quotes: "
matches a "
, then ([^"]*)
captures zero or more chars other than "
and then "
matches a "
char (NOTE: to match string between quotes with escaped quote support use "([^"\\]*(?:\\.[^"\\]*)*)"
, add a similar pattern for single quotes if needed)|
- or(\d+(?:\.\d+)*)
- Group 2: one or more digits and then zero or more sequences of .
and one or more digits|
- or(\w+)
- Group 3: one or more word chars.