pythonpandasapache

How to get a pandas dataframe from an apache log?


I'm trying to parse some apache log, but I'm having trouble making a dataframe from an apache log.

The log has this format:

201.179.162.179 - - [17/Sep/2019:06:30:49 -0300] "teSubmit=Save" 400 0 "-" "-"
201.179.162.179 - - [17/Sep/2019:06:30:49 -0300] "POST /cgi-bin/ViewLog.asp HTTP/1.1" 404 0 "-" "Ankit"
80.95.44.9 - - [17/Sep/2019:06:31:55 -0300] "GET / HTTP/1.1" 200 12101 "http://netlab.ice.ufjf.br/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
50.31.26.18 - - [17/Sep/2019:06:32:14 -0300] "GET /wp-login.php HTTP/1.1" 200 1514 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
50.31.26.18 - - [17/Sep/2019:06:32:14 -0300] "POST /wp-login.php HTTP/1.1" 200 1897 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"

Here's what I have so far:

file = open('access.txt')
lines = file.readlines()

logs = pd.DataFrame(columns=['ip', 'indentd', 'userid', 'time', 'request', 'status', 'size', 'Referer', 'User_agent'])

regc = re.compile('(?P<ip>.*?) - - \[(?P<time>.*?)\] "(?P<request>.*?)" (?P<status>\d+) (?P<size>\d+) (?P<Referer>.*?) (?P<User_agent>.*?)')

for line in lines:
    m = regc.match(line)
    print(m)
    ip = m.group('ip')
    identd = m.group('identd')
    userid = m.group('userid')
    time = m.group('time')
    request = m.group('request')
    status = m.group('status')
    size = m.group('size')
    Referer = m.group('Referer')
    User_agent = m.group('User_agent')
    logs.append([ip, identd, userid, time, request, status, size, Referer, User_agent])

logs

And all I get as output is the column names. Does this logs.append() work the way I want?


Solution

  • ok , I have this example , notice that I comment some lines :

    import pandas as pd
    import re
    
    file = open('log.txt')
    lines = file.readlines()
    
    #logs = pd.DataFrame(columns=['ip', 'time', 'request', 'status', 'size', 'Referer', 'User_agent'])
    logs = pd.DataFrame({'ip': [], 'time': [], 'request': [], 'status': [], 'size': [], 'Referer': [], 'User_agent': [] })
    
    regc = re.compile('(?P<ip>.*?) - - \[(?P<time>.*?)\] "(?P<request>.*?)" (?P<status>\d+) (?P<size>\d+) (?P<Referer>.*?) (?P<User_agent>.*?)')
    
    for line in lines:
        m = regc.match(line)
        print(m)
        ip = m.group('ip')
        #identd = m.group('identd')
        #userid = m.group('userid')
        time = m.group('time')
        request = m.group('request')
        status = m.group('status')
        size = m.group('size')
        Referer = m.group('Referer')
        User_agent = m.group('User_agent')
        #logs.append([ip,time, request, status, size, Referer, User_agent], ignore_index=False)
        logs = logs.append({'ip':ip, 'time': time, 'request': request, 'status':status, 'size': size, 'Referer': Referer, 'User_agent':User_agent }, ignore_index=True)
    print(logs.head())