pythonemailmbox

Decode and access mbox file with mbox Python mdule


I need to migrate an email database to a CRMand have 2 problems:

I get to access the mbox file but the content is not properly decoded.

I want to create a dataframe like structure with following columns: "date, from, to, subject, body"

I have tried the following:

for i, message in enumerate(mbox):
    print("from   :",message['from'])
    print("subject:",message['subject'])
    if message.is_multipart():
        content = (part.get_payload(decode=True) for part in message.get_payload())
    else:
        content = message.get_payload(decode=True)
    print("content:",content)
    print("**************************************")

    if i == 10:
        break

and get the following output:

from   : =?UTF-8?Q?Gonzalo_Gasset_Yba=C3=B1ez?= <gonzalo.gasset@baud.es>
subject: =?UTF-8?Q?Marqu=C3=A9s_de_Vargas_=26_Baud?=
content: <generator object <genexpr> at 0x7fe025f3a350>
**************************************
from   : Mailtrack Reminder <reminders@mailtrack.io>
subject: Re: Presupuesto de Logotipo y =?utf-8?Q?Dise=C3=B1o?= Corporativo
 para nuevo proyecto
content: b'<!DOCTYPE html>\r\n<html>\r\n<head>\r\n    <meta charset="utf-8">\r\n    <meta name="viewport" content="width=device-width">\r\n    <title>Reminder</title>\r\n</head>\r\n<style media="screen">\r\n    body {\r\n        font-family: Helvetica;\r\n    }\r\n</style>\r\n<body style="background-color: #f6f6f6; -webkit-font-smoothing: antialiased; font-size: 14px; line-height: 1.4; margin: 0; padding: 0; .....

Solution

  • The concrete implementations of mailbox.Mailbox accept a factory argument that can be used to build messages. By passing the parse method of a BytesParser initialised with the default policy we can generate EmailMessages which will decode headers and body text automatically.

    Selecting the actual body is trickier, and perhaps depends on your particular requirements. In the code sample below, any "text" type parts are joined together, while non-text parts are rejected. You might wish to apply your own selection criteria.

    from email.parser import BytesParser
    from email.policy import default
    import mailbox
    
    mbox = mailbox.mbox(path_to_mailbox, factory=BytesParser(policy=default).parse)
    
    for _, message in enumerate(mbox):
        print("date:  :", message['date'])
        print("to:    :", message['to'])
        print("from   :", message['from'])
        print("subject:", message['subject'])
        if message.is_multipart():
            contents = []
            for part in message.walk():
                maintype = part.get_content_maintype()
                if maintype == 'multipart' or maintype != 'text':
                    # Reject containers and non-text types
                    continue
                contents.append(part.get_content())
            content = '\n\n'.join(contents)
        else:
            content = message.get_content()
        print("content:", content)
        print("**************************************")