pythonpython-2.7parsinglogparser

Keep getting `NoneType` errors when parsing logs with regex


I have an example of what two of the logs look like below. I'm trying to get out the ip, date_time, method, this part (/071300/242153 HTTP/1.1"), response code (just 404/200 part), and the rest in one group:

66.249.69.97 - - [24/Sep/2014:22:25:44 +0000] "GET /071300/242153 HTTP/1.1" 404 514 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

and

71.19.157.174 - - [24/Sep/2014:22:26:12 +0000] "GET /error HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"

My function looks like:

def parse_logs(logs):
  log_list = []
  for log in logs:
    p = re.compile(r'''(?P<ip_addr>\d+(\.\d+){3}) - - \[(?P<date_time>.+?)\] (?P<http_method>\".+?\") (?P<return_code>\d+) \d+ "-" (?P<client>\".+?\")''')

    m = p.search(log)

    log_list.append([m.group('ip_addr'), m.group('date_time'), m.group('http_method'), m.group('return_code'), m.group('client')])

rdd_prepped = parse_logs(rdd.take(5))

When I pass a list of these logs to the function and run it, I keep getting the error: AttributeError: 'NoneType' object has no attribute 'collect'.

When I put a print(m.group('ip')) line under m = p.search(log), I get the error:

AttributeError: 'NoneType' object has no attribute 'group'

Why do I keep getting NoneTypes? I'm using Python2.7 btw.


Solution

  • When this was first posted, the regex looked like this:

    p = re.compile(r'''(?P<ip>\d+(\.\d+){3}) - - \[(?P<date_time>.+?)\] (?P<method>\".+?\") \
        (?P<response_code>\d+) \d+ "-" (?P<client>\".+?\")''')
    

    Note the line continuation character (a '\') at the end of the first line. But the pattern is also in triple quotes. So the pattern includes the text '\\n ' (slash + newline + indent). As a result, the pattern wouldn't match.

    Rewrite the pattern on a single line and it should work:

    p = re.compile(r'''(?P<ip>\d+(\.\d+){3}) - - \[(?P<date_time>.+?)\] (?P<method>\".+?\") (?P<response_code>\d+) \d+ "-" (?P<client>\".+?\")''')
    

    For complicated regular expressions, I like to use verbose mode:

    regex = re.compile("""
        (?P<ip>\d+(?:\.\d+){3})     # four, dot-separated sets of digits
        .*?                         # skip ahead
        \[(?P<date_time>.*?)\]      # date time is everything between '[ ]'
        .*?                         # skip 
        "(?P<method>.*?)"           # method is everything between quotes
        .*?                         # skip 
        (?P<response_code>\d+)      # multiple digits
        .*?                         # skip
        "-"                         # don't care
        .*?                         #
        "(?P<client>.*?)"           # client is everything between quotes
        """, re.VERBOSE)
    

    A few more things:

    If you expect the regex to match (almost) every line in the log, then you should print/log any lines that don't match. That helps catch errors in your regex, or when someone changes the log format without telling you.

    Move the re.compile step out of the loop.

    MatchObject.group() can take multiple arguments and returns a tuple of the listed groups.

    def parse_logs(logs):
      log_list = []
    
      p = re.compile(...whichever regex style you like...)
    
      for log in logs:
    
        m = p.search(log)
    
        if m:
          log_list.append(m.group('ip_addr', 'date_time', 'http_method, 
                                  'return_code', 'client'))
        else:
            print(log)
    
    rdd_prepped = parse_logs(rdd.take(5))