pythonfile-io

Parsing apache log files


I just started learning Python and would like to read an Apache log file and put parts of each line into different lists.

line from the file

172.16.0.3 - - [25/Sep/2002:14:04:19 +0200] "GET / HTTP/1.1" 401 - "" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827"

according to Apache website the format is

%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\

I'm able to open the file and just read it as it is but I don't know how to make it read in that format so I can put each part in a list.


Solution

  • This is a job for regular expressions.

    For example:

    line = '172.16.0.3 - - [25/Sep/2002:14:04:19 +0200] "GET / HTTP/1.1" 401 - "" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827"'
    regex = '^([(\d\.)]+) [^ ]* [^ ]* \[([^ ]* [^ ]*)\] "([^"]*)" (\d+) [^ ]* "([^"]*)" "([^"]*)"'
    
    import re
    print re.match(regex, line).groups()
    

    The output would be a tuple with 6 pieces of information from the line (specifically, the groups within parentheses in that pattern):

    ('172.16.0.3', '25/Sep/2002:14:04:19 +0200', 'GET / HTTP/1.1', '401', '', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827')