I'm trying to write a script that will convert each email element of an .mbox
file into a .csv
file. I specifically need the following elements, but if there was a way to "write for each element," that'd be preferred:
To, From, CC'd, BCC'd, Date, Subject, Body
I found a script online that looks to be the start of what I need, and the documentation about the email module, but I can't find any specifics on how to
to
, from
, cc
'd, etc.).csv
. Here is sample code I found:
import mailbox
import csv
writer = csv.writer(open("clean_mail_B.csv", "wb"))
for message in mailbox.mbox('Saks.mbox'):
writer.writerow([message['to'], message['from'], message['date']])
To do that you would first need to determine the complete list of possible keys present in all mailbox items. Then you can use that to write the CSV header.
Next you need to get all the key value pairs from each message using .items()
. This can then be converted back into a dictionary and written to your CSV file.
The mailbox
library unfortunately does not directly expose the message dictionary otherwise it would have been possible to write this directly.
import mailbox
import csv
mbox_file = 'sample.mbox'
with open('clean_mail_B.csv', 'w', newline='', encoding='utf-8') as f_output:
# Create a column for the first 30 message payload sections
fieldnames = {f'Part{part:02}' for part in range(1, 31)}
for message in mailbox.mbox(mbox_file):
fieldnames.update(message.keys())
csv_output = csv.DictWriter(f_output, fieldnames=sorted(fieldnames), restval='', extrasaction='ignore')
csv_output.writeheader()
for message in mailbox.mbox(mbox_file):
items = dict(message.items())
for part, payload in enumerate(message.get_payload(), start=1):
items[f'Part{part:02}'] = payload
csv_output.writerow(items)
A DictWriter
is used rather than a standard CSV writer. This would then cope better for when certain message do not contain all possible header values.
The message payload can be in multiple parts, these are added as separate column headers e.g. Part01
, Part02
. Normally there should be 1 or 2 but your sample mbox contained one with a strange signature containing 25?
If the mbox
contains more payload entries for a message (i.e. >30), these are ignored using extrasaction='ignore'
. An alternative approach would be to combine all payloads into a single column.