pythonregexemail-parsing

Using regular expression to parse email data


I want to parse through my email inbox and find marketing emails with coupon codes in them extract the code from them the logic I have written works on only singular type of data.

def extract_promo_code(body):
    # Use regular expressions to find promo code
    promo_code_pattern = r'(?i)(?:Enter\s+Code|Enter\s+promo)(?:[\s\n]*)([A-Z0-9]+)'
    match = re.search(promo_code_pattern, body)
    if match:
        promo_code = match.group(1)
        # Remove any non-alphanumeric characters from the promo code
        promo_code = re.sub(r'[^A-Z0-9]', '', promo_code)
        return promo_code
    else:
        return None

Following are a couple of samples from which I want to extract coupon code:

  1. "Enter code at checkout.* Offer valid until October 6, 2023, 11:59pm CT MKEA15EMYZGP8W"

  2. "Enter code JSB20GR335F4 Ends September 21, 2023, at 11:59pm CT.*"

I want the code to catch the first promo code the comes after the text "Enter Code" or "enter promo" which consists a mix of digits and uppercase letters even if there are line breaks and spaces between text and promo code.

The above code runs fine for sample 2 but doesn't catch the code in sample 1.


Solution

  • You can use (you can adjust the pattern, I used that the promo-code has at minimum 10 characters) (regex101 demo):

    import re
    
    text = """\
    Enter code at checkout.* 
    Offer valid until October 6, 2023, 11:59pm CT MKEA15EMYZGP8W
    
    Enter code JSB20GR335F4 Ends September 21, 2023, at 11:59pm CT.*
    """
    
    pat = r"""(?s)Enter (?:code|promo).*?\b([A-Z\d]{10,})"""
    
    for code in re.findall(pat, text):
        print(code)
    

    Prints:

    MKEA15EMYZGP8W
    JSB20GR335F4