pythonregex

Coursera Course - Introduction of Data Science in Python Assignment 1


I'm taking this course on Coursera, and I'm running some issues while doing the first assignment. The task is to basically use regular expression to get certain values from the given file. Then, the function should output a dictionary containing these values:

example_dict = {"host":"146.204.224.152", 

                "user_name":"feest6811", 

                "time":"21/Jun/2019:15:45:24 -0700",

                "request":"POST /incentivize HTTP/1.1"} 

This is just a screenshot of the file. Due to some reasons, the link doesn't work if it's not open directly from Coursera. I apologize in advance for the bad formatting. One thing I must point out is that for some cases, as you can see in the first example, there's no username. Instead '-' is used.

159.253.153.40 - - [21/Jun/2019:15:46:10 -0700] "POST /e-business HTTP/1.0" 504 19845
136.195.158.6 - feeney9464 [21/Jun/2019:15:46:11 -0700] "HEAD /open-source/markets HTTP/2.0" 204 21149 

This is what I currently have right now. However, the output is None. I guess there's something wrong in my pattern.

import re
def logs():
    
    with open("assets/logdata.txt", "r") as file:
        logdata = file.read()
    # YOUR CODE HERE
        
        pattern = """ 
        (?P<host>\w*)
        (\d+\.\d+.\d+.\d+\ )
        (?P<user_name>\w*)
        (\ -\ [a-z]+[0-9]+\ )
        (?P<time>\w*)
        (\[(.*?)\])
        (?P<request>\w*)
        (".*")
        """
        for item in re.finditer(pattern,logdata,re.VERBOSE):
       
            print(item.groupdict())

Solution

  • You can use the following expression:

    (?P<host>\d+(?:\.\d+){3}) # 1+ digits and 3 occurrenses of . and 3 digits
    \s+\S+\s+                 # 1+ whitespaces, 1+ non-whitespaces, 1+ whitespaces
    (?P<user_name>\S+)\s+\[   # 1+ non-whitespaces (Group "user_name"), 1+ whitespaces and [
    (?P<time>[^\]\[]*)\]\s+   # Group "time": 0+ chars other than [ and ], ], 1+ whitespaces
    "(?P<request>[^"]*)"      # ", Group "request": 0+ non-" chars, "
    

    See the regex demo. See the Python demo:

    import re
    logdata = r"""159.253.153.40 - - [21/Jun/2019:15:46:10 -0700] "POST /e-business HTTP/1.0" 504 19845
    136.195.158.6 - feeney9464 [21/Jun/2019:15:46:11 -0700] "HEAD /open-source/markets HTTP/2.0" 204 21149"""
    pattern = r'''
    (?P<host>\d+(?:\.\d+){3}) # 1+ digits and 3 occurrenses of . and 3 digits
    \s+\S+\s+                 # 1+ whitespaces, 1+ non-whitespaces, 1+ whitespaces
    (?P<user_name>\S+)\s+\[   # 1+ non-whitespaces (Group "user_name"), 1+ whitespaces and [
    (?P<time>[^\]\[]*)\]\s+   # Group "time": 0+ chars other than [ and ], ], 1+ whitespaces
    "(?P<request>[^"]*)"      # ", Group "request": 0+ non-" chars, "
    '''
    for item in re.finditer(pattern,logdata,re.VERBOSE):
        print(item.groupdict())
    

    Output:

    {'host': '159.253.153.40', 'user_name': '-', 'time': '21/Jun/2019:15:46:10 -0700', 'request': 'POST /e-business HTTP/1.0'}
    {'host': '136.195.158.6', 'user_name': 'feeney9464', 'time': '21/Jun/2019:15:46:11 -0700', 'request': 'HEAD /open-source/markets HTTP/2.0'}