I'm a long time reader, first time asker (please be gentle).
I've been doing this with a pretty messy WHILE READ in Unix Bash, but I'm learning python and would like to try to make a more effective parser routine.
So I have a bunch of log files which are space delimited mostly, but contain square braces where there may also be spaces. How to ignore content within the braces, when looking for delimiters?
(I'm assuming that RE library is necessary to do this)
i.e. sample input:
[21/Sep/2014:13:51:12 +0000] serverx 192.0.0.1 identity 200 8.8.8.8 - 500 unavailable RESULT 546 888 GET http ://www.google.com/something/fsd?=somegibberish&youscanseethereisalotofcharactershere+bananashavealotofpotassium [somestuff/1.0 (OSX v. 1.0; this_is_a_semicolon; colon:93.1.1) Somethingelse/1999 (COMMA, yep_they_didnt leave_me_a_lot_to_make_this_easy) DoesanyonerememberAOL/1.0]
Desired output:
'21/Sep/2014:13:51:12 +0000'; 'serverx'; '192.0.0.1'; 'identity'; '200'; '8.8.8.8'; '-'; '500'; 'unavailable'; 'RESULT'; '546'; '888'; 'GET'; 'htp://www.google.com/something/fsd?=somegibberish&youscanseethereisalotofcharactershere+bananashavealotofpotassium'; 'somestuff/1.0 (OSX v. 1.0; this_is_a_semicolon; rev:93.1.1) Somethingelse/1999 (COMMA, yep_they_didnt leave_me_a_lot_to_make_this_easy DoesanyonerememberAOL/1.0'
If you'll notice the first and last fields (the ones that were in the square braces) still have spaces intact.
Bonus points The 14th field (URL) is always in one of these formats:
htp://google.com/path-data-might-be-here-and-can-contain-special-characters
google.com/path-data-might-be-here-and-can-contain-special-characters
xyz.abc.www.google.com/path-data-might-be-here-and-can-contain-special-characters
google.com:443
I'd like to add an additional column to the data which includes just the domain (i.e. xyz.abc.www.google.com or google.com).
Until now, I've been taking the parsed output using a Unix AWK with an IF statement to split this field by '/' and check to see if the third field is blank. If it is, then return first field (up until the : if it is present), otherwise return third field). If there is a better way to do this--preferably in the same routine as above, I'd love to hear it--so my final output could be:
'21/Sep/2014:13:51:12 +0000'; 'serverx'; '192.0.0.1'; 'identity'; '200'; '8.8.8.8'; '-'; '500'; 'unavailable'; 'RESULT'; '546'; '888'; 'GET'; 'htp://www.google.com/something/fsd?=somegibberish&youscanseethereisalotofcharactershere+bananashavealotofpotassium'; 'somestuff/1.0 (OSX v. 1.0; this_is_a_semicolon; rev:93.1.1) Somethingelse/1999 (COMMA, yep_they_didnt leave_me_a_lot_to_make_this_easy DoesanyonerememberAOL/1.0'; **'www.google.com'**
Footnote: I changed http to htp in the sample, so it wouldn't create a bunch of distracting links.
The regular expression pattern \[[^\]]*\]|\S+
will tokenize your data, though it doesn't strip off the brackets from the multi-word values. You'll need to do that in a separate step:
import re
def parse_line(line):
values = re.findall(r'\[[^\]]*\]|\S+', line)
values = [v.strip("[]") for v in values]
return values
Here's a more verbose version of the regular expression pattern:
pattern = r"""(?x) # turn on verbose mode (ignores whitespace and comments)
\[ # match a literal open bracket '['
[^\]]* # match zero or more characters, as long as they are not ']'
\] # match a literal close bracket ']'
| # alternation, match either the section above or the section below
\S+ # match one or more non-space characters
"""
values = re.findall(pattern, line) # findall returns a list with all matches it finds