pythonemailmbox

Read a big .mbox file with Python


I'd like to read a big 3GB .mbox file coming from a Gmail backup. This works:

import mailbox
mbox = mailbox.mbox(r"D:\All mail Including Spam and Trash.mbox")
for i, message in enumerate(mbox):
    print("from   :",message['from'])
    print("subject:",message['subject'])
    if message.is_multipart():
        content = ''.join(part.get_payload(decode=True) for part in message.get_payload())
    else:
        content = message.get_payload(decode=True)
    print("content:",content)
    print("**************************************")

    if i == 10:
        break

except it takes more than 40 seconds for the first 10 messages only.

Is there a faster way to access to a big .mbox file with Python?


Solution

  • Here's a quick and dirty attempt to implement a generator to read in an mbox file message by message. I have opted to simply ditch the information from the From separator; I'm guessing maybe the real mailbox library might provide more information, and of course, this only supports reading, not searching or writing back to the input file.

    #!/usr/bin/env python3
    
    import email
    from email.policy import default
    
    class MboxReader:
        def __init__(self, filename):
            self.handle = open(filename, 'rb')
            assert self.handle.readline().startswith(b'From ')
    
        def __enter__(self):
            return self
    
        def __exit__(self, exc_type, exc_value, exc_traceback):
            self.handle.close()
    
        def __iter__(self):
            return iter(self.__next__())
    
        def __next__(self):
            lines = []
            while True:
                line = self.handle.readline()
                if line == b'' or line.startswith(b'From '):
                    yield email.message_from_bytes(b''.join(lines), policy=default)
                    if line == b'':
                        break
                    lines = []
                    continue
                lines.append(line)
    

    Usage:

    with MboxReader(mboxfilename) as mbox:
        for message in mbox:
            print(message.as_string())
    

    The policy=default argument (or any policy instead of default if you prefer, of course) selects the modern EmailMessage library which was introduced in Python 3.3 and became official in 3.6. If you need to support older Python versions from before America lost its mind and put an evil clown in the White House simpler times, you will want to omit it; but really, the new API is better in many ways.