pythonapacheparsingloggingshlex

How to speed up this Apache log parsing?


I'm parsing big Apache logs like:

example.com:80 1.2.3.4 - - [01/Jul/2021:06:12:12 +0000] "GET /test/example/index.php?a=b&c=d HTTP/1.1" 302 486 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.3945.117 Safari/537.36"

with:

import apache_log_parser, shlex
parser = apache_log_parser.make_parser("%v:%p %h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"")
with open("access.log") as f:
    for l in enumerate(f):
        x = parser(l)

Solution

  • I finally found a solution that does a x10 speed improvement: pure regex.

    import re
    r = re.compile(r'(?P<server>.*?):(?P<port>.*?) (?P<ip>.*?) (?P<remote_log_name>.*?) (?P<userid>.*?) \[(?P<date>.*?)\] \"(?P<request>.*?)\" (?P<status>.*?) (?P<length>.*?) \"(?P<referer>.*?)\" \"(?P<useragent>.*?)\"')
    
    with open("access.log") as f:
        for l in enumerate(f):
            d = next(r.finditer(l)).groupdict()
            d['url'] = d['request'].split()[1] if ' ' in d['request'] else '-'
            # d['date'] = datetime.datetime.strptime(d['date'], '%d/%b/%Y:%H:%M:%S %z').isoformat()  # optional
    

    ~ 0.01 ms per line on my i5 laptop.