pythoncsvemailmbox

Python 3.6 Mbox to CSV


I'm trying to write a script that will convert each email element of an .mbox file into a .csv file. I specifically need the following elements, but if there was a way to "write for each element," that'd be preferred:

To, From, CC'd, BCC'd, Date, Subject, Body

I found a script online that looks to be the start of what I need, and the documentation about the email module, but I can't find any specifics on how to

  1. identify the different attribute options (to, from, cc'd, etc.)
  2. how to write them as unique cell values in a .csv.

Here is sample code I found:

import mailbox
import csv

writer = csv.writer(open("clean_mail_B.csv", "wb"))
for message in mailbox.mbox('Saks.mbox'):
    writer.writerow([message['to'], message['from'], message['date']])

Solution

  • To do that you would first need to determine the complete list of possible keys present in all mailbox items. Then you can use that to write the CSV header.

    Next you need to get all the key value pairs from each message using .items(). This can then be converted back into a dictionary and written to your CSV file.

    The mailbox library unfortunately does not directly expose the message dictionary otherwise it would have been possible to write this directly.

    import mailbox
    import csv
    
    mbox_file = 'sample.mbox'
    
    with open('clean_mail_B.csv', 'w', newline='', encoding='utf-8') as f_output:
        # Create a column for the first 30 message payload sections
        fieldnames = {f'Part{part:02}' for part in range(1, 31)}
    
        for message in mailbox.mbox(mbox_file):
            fieldnames.update(message.keys())
    
        csv_output = csv.DictWriter(f_output, fieldnames=sorted(fieldnames), restval='', extrasaction='ignore')
        csv_output.writeheader()
    
        for message in mailbox.mbox(mbox_file):
            items = dict(message.items())
    
            for part, payload in enumerate(message.get_payload(), start=1):
                items[f'Part{part:02}'] = payload
    
            csv_output.writerow(items)
    

    A DictWriter is used rather than a standard CSV writer. This would then cope better for when certain message do not contain all possible header values.

    The message payload can be in multiple parts, these are added as separate column headers e.g. Part01, Part02. Normally there should be 1 or 2 but your sample mbox contained one with a strange signature containing 25?

    If the mbox contains more payload entries for a message (i.e. >30), these are ignored using extrasaction='ignore'. An alternative approach would be to combine all payloads into a single column.