I'm trying to parse some apache log, but I'm having trouble making a dataframe from an apache log.
The log has this format:
201.179.162.179 - - [17/Sep/2019:06:30:49 -0300] "teSubmit=Save" 400 0 "-" "-"
201.179.162.179 - - [17/Sep/2019:06:30:49 -0300] "POST /cgi-bin/ViewLog.asp HTTP/1.1" 404 0 "-" "Ankit"
80.95.44.9 - - [17/Sep/2019:06:31:55 -0300] "GET / HTTP/1.1" 200 12101 "http://netlab.ice.ufjf.br/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
50.31.26.18 - - [17/Sep/2019:06:32:14 -0300] "GET /wp-login.php HTTP/1.1" 200 1514 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
50.31.26.18 - - [17/Sep/2019:06:32:14 -0300] "POST /wp-login.php HTTP/1.1" 200 1897 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
Here's what I have so far:
file = open('access.txt')
lines = file.readlines()
logs = pd.DataFrame(columns=['ip', 'indentd', 'userid', 'time', 'request', 'status', 'size', 'Referer', 'User_agent'])
regc = re.compile('(?P<ip>.*?) - - \[(?P<time>.*?)\] "(?P<request>.*?)" (?P<status>\d+) (?P<size>\d+) (?P<Referer>.*?) (?P<User_agent>.*?)')
for line in lines:
m = regc.match(line)
print(m)
ip = m.group('ip')
identd = m.group('identd')
userid = m.group('userid')
time = m.group('time')
request = m.group('request')
status = m.group('status')
size = m.group('size')
Referer = m.group('Referer')
User_agent = m.group('User_agent')
logs.append([ip, identd, userid, time, request, status, size, Referer, User_agent])
logs
And all I get as output is the column names. Does this logs.append()
work the way I want?
ok , I have this example , notice that I comment some lines :
import pandas as pd
import re
file = open('log.txt')
lines = file.readlines()
#logs = pd.DataFrame(columns=['ip', 'time', 'request', 'status', 'size', 'Referer', 'User_agent'])
logs = pd.DataFrame({'ip': [], 'time': [], 'request': [], 'status': [], 'size': [], 'Referer': [], 'User_agent': [] })
regc = re.compile('(?P<ip>.*?) - - \[(?P<time>.*?)\] "(?P<request>.*?)" (?P<status>\d+) (?P<size>\d+) (?P<Referer>.*?) (?P<User_agent>.*?)')
for line in lines:
m = regc.match(line)
print(m)
ip = m.group('ip')
#identd = m.group('identd')
#userid = m.group('userid')
time = m.group('time')
request = m.group('request')
status = m.group('status')
size = m.group('size')
Referer = m.group('Referer')
User_agent = m.group('User_agent')
#logs.append([ip,time, request, status, size, Referer, User_agent], ignore_index=False)
logs = logs.append({'ip':ip, 'time': time, 'request': request, 'status':status, 'size': size, 'Referer': Referer, 'User_agent':User_agent }, ignore_index=True)
print(logs.head())