pythonapachehadoophadoop-streaminglog-files

How to process a apache log file with hadoop using python


I am very newbie to hadoop and unable to understand the concept well, I had followed below process

  1. Installed Hadoop by seeing here

  2. Tried the basic examples in tutorial by seeing here and worcount example in python and working fine with them.

Actually what i am trying to do/the requirement i got is processing an apache log files in fedora(linux) located at /var/log/httpd with hadoop using python in the below format

IP address    Count of IP   Pages accessed by IP address

I know that apache log files will be of two kinds

  1. access_logs

  2. error_logs

but i am really unable to understand the format of apache log files.

My apache log file content is something like below

::1 - - [29/Oct/2012:15:20:15 +0530] "GET /phpMyAdmin/ HTTP/1.1" 200 6961 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.77 Safari/537.1"
::1 - - [29/Oct/2012:15:20:16 +0530] "GET /phpMyAdmin/js/cross_framing_protection.js?ts=1336063073 HTTP/1.1" 200 331 "http://localhost/phpMyAdmin/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.77 Safari/537.1"
::1 - - [29/Oct/2012:15:20:16 +0530] "GET /phpMyAdmin/js/jquery/jquery-1.6.2.js?ts=1336063073 HTTP/1.1" 200 92285 "http://localhost/phpMyAdmin/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.77 Safari/537.1"

Can anyone please explain me the structure of above/apache log files

I am confused on how to process the the log file with the data Ip address, countof ip address, pages accessed by ip address

Can anyone let me know how we can process the apache log files with haddop using python and above information and store the result in the above mentioned format

Also can anyone please provide a basic code in python for processing the apache log files in the above format, so that i will get an real time idea on how to process the files with python code and will extend them according to needs


Solution

  • This is just a partial answer but I hope you will find it of use, if you need anything more specific please update your question with your code and the specific points you are getting stuck on.

    file processing stuff

    The Python docs explain file processing really well.

    If you want to monitor the log files in real-time (I think that's what your question meant...) then check out this question here. It's also about monitoring a log file. I don't really like the accepted answer but there are lots of nice suggestions.

    line processing stuff

    Once you manage to get individual lines from the log file then you'll want to process them. They are just strings so as long as you know the format it's pretty simple. Again I refer to the python docs. In case you want to do anything intense you might want to check that out.

    Now given the format of the line you gave us:

    EDIT given the actual format of log lines we can now make progress...

    So if you grab a line from a log file such that:

    line = '::1 - - [29/Oct/2012:15:20:15 +0530] "GET /phpMyAdmin/ HTTP/1.1" 200 6961 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.77 Safari/537.1"'
    

    First step is to split it up into different pieces. I make use of the fact that the date and time are surrounded in '[...]'

    lElements = line.split('[')
    lElements = lElements[0] + lElements[1].split(']')
    

    This leaves us with:

    lElements[0] = '::1 - - ' #IPv6 localhost = ::1
    lElements[1] = '29/Oct/2012:15:20:15 +0530'
    lElements[2] = ' "GET /phpMyAdmin/ HTTP/1.1" 200 6961 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.77 Safari/537.1"'
    

    The date element can be converted into a friendlier format

    The 'url' element contains stuff about the actual request (the HTTP verb, HTTP version, a mysterious number and a bunch of user-agent stuff).

    EDIT Adding code to grab url and ip address. ignoreing time stuffs

    ip_address = lElements[0].split('-')[0] # I'm just throwing away those dashes. are they important?
    http_info = lElements[2].split('"')[1] # = 'GET /phpMyAdmin/ HTTP/1.1'
    url = http_info.split()[1]  # = '/phpMyAdmin/'
    
    """
    so now we have the ip address and the url. the next bit of code updates a dictionary dAccessCount as the number of url accesses increases...
    dAccessCount should be set to {} initially
    """
    
    if ip_address in dAccessCount:
        if url in dAccessCount[ip_address]:
            dAccessCount[ip_address][url]+=1
        else:
            dAccessCount[ip_address][url]=1
    else:
        dAccessCount[ip_address] = {url:1}
    

    So the keys of dAccessCount are the all the ip addresses that have accessed any url, and the keys of dAccessCount[some_ip_address] are all the urls that that ip_address has accessed, and finally: dAccessCount[some_ip_address][some_url] = the number of times some_url was accessed from some_ip_address.