pythonemailmbox

Access all fields in mbox using mailbox


I am attempting to perform some processing on email messages in mbox format.

After searching, and a bit of trial and error tried https://docs.python.org/3/library/mailbox.html#mbox

I have got this to do most of what I want (even though I had to write code to decode subjects) using the test code listed below.

I found this somewhat hit and miss, in particular the key needed to look up fields 'subject' seems to be trial and error, and I can't seem to find any way to list the candidates for a message. (I understand that the fields may differ from email to email.)

Can anyone help me to list the possible values?

I have another issue; the email may contain a number of "Received:" fields e.g.

Received: from awcp066.server-cpanel.com
Received: from mail116-213.us2.msgfocus.com ([185.187.116.213]:60917)
    by awcp066.server-cpanel.com with esmtps (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256)

I am interested in accessing the FIRST chronologically - I would be happy to search, but can't seem to find any way to access any but the first in the file.

#! /usr/bin/env python3
#import locale
#2020-08-31

"""
Extract Subject from MBOX file
"""

import os, time
import mailbox
import base64, quopri

def isbqencoded(s):
    """
    Test if Base64 or Quoted Printable strings
    """
    return s.upper().startswith('=?UTF-8?')

def bqdecode(s):
    """
    Convert UTF-8 Base64 or Quoted Printable string to str
    """
    nd = s.find('?=', 10)
    if s.upper().startswith('=?UTF-8?B?'):   # Base64
        bbb = base64.b64decode(s[10:nd])
    elif s.upper().startswith('=?UTF-8?Q?'): # Quoted Printable
        bbb = quopri.decodestring(s[10:nd])
    return bbb.decode("utf-8")

def sdecode(s):
    """
    Convert possibly multiline Base64 or Quoted Printable strings to str
    """
    outstr = ""
    if s is None:
        return outstr
    for ss in str(s).splitlines():   # split multiline strings
        sss = ss.strip()
        for sssp in sss.split(' '):   # split multiple strings
            if isbqencoded(sssp):
                outstr += bqdecode(sssp)
            else:
                outstr += sssp
            outstr+=' '
        outstr = outstr.strip()
    return outstr

INBOX = '~/temp/2020227_mbox'

print('Messages in ', INBOX)
mymail = mailbox.mbox(INBOX)
print('Values = ', mymail.values())
print('Keys = ', mymail.keys())
# print(mymail.items)
# for message in mailbox.mbox(INBOX):
for message in mymail:

#     print(message)
    subject = message['subject']
    to = message['to']
    id = message['id']
    received = message['Received']
    sender = message['from']
    ddate = message['Delivery-date']
    envelope = message['Envelope-to']


    print(sdecode(subject))
    print('To ', to)
    print('Envelope ', envelope)
    print('Received ', received)
    print('Sender ', sender)
    print('Delivery-date ', ddate)
#     print('Received ', received[1])

Based on this answer I simplified the Subject decoding, and got similar results.

I am still looking for suggestions to access the remainder of the Header - in particular how to access multiple "Received:" fields.

#! /usr/bin/env python3
#import locale
#2020-09-02

"""
Extract Subject from MBOX file
"""

import os, time
import mailbox
from email.parser import BytesParser
from email.policy import default

INBOX = '~/temp/2020227_mbox'
print('Messages in ', INBOX)

mymail = mailbox.mbox(INBOX, factory=BytesParser(policy=default).parse)

for _, message in enumerate(mymail):
    print("date:  :", message['date'])
    print("to:    :", message['to'])
    print("from   :", message['from'])
    print("subject:", message['subject'])
    print('Received: ', message['received'])

    print("**************************************")

Solution

  • Based on a Comment by snakecharmerb (now edited into the Question) I simplified the process.
    In the end I did not need to decode received, because the Message-ID actually extracts the id from the original received field.

    I list the code I finally used, in case this is of use to others. This code just extracts header fields of interest and prints them, but the full code performs analysis on the messages.

    #! /usr/bin/env python3
    #import locale
    #2020-09-05
    
    """
    Extract Message Header details from MBOX file
    """
    
    import os, time
    import mailbox
    from email.parser import BytesParser
    from email.policy import default
    
    INBOX = '~/temp/Gmail'
    
    print('Messages in ', INBOX)
    
    mymail = mailbox.mbox(INBOX, factory=BytesParser(policy=default).parse)
    
    for _, message in enumerate(mymail):
        date = message['date']
        to = message['to']
        sender = message['from']
        subject = message['subject']
        messageID = message['Message-ID']
        received = message['received']
        deliveredTo = message['Delivered-To']
        if(messageID == None): continue
    
        print("Date        :", date)
        print("From        :", sender)
        print("To:         :", to)
        print('Delivered-To:', deliveredTo)
        print("Subject     :", subject)
        print("Message-ID  :", messageID)
    #     print('Received    :', received)
    
        print("**************************************")