pythonhtmlmbox

Write html file from mbox


Before Yahoo groups was closed, you could download the content of a group to an mbox file. I am trying to convert the mbox file to a series of html files - one for each message. My problem is dealing with the encoding and special characters in the html. Here is my attempt:

import mailbox

the_dir = "/path/to/file"

mbox = mailbox.mbox(the_dir + "12394334.mbox")

html_header = """<!DOCTYPE html>
<html>
<head>
<title>Email message</title>
</head>
<body>"""    
html_footer = '</body></html>'

for message in mbox:
    mess_from = message['from']
    subject = message['subject']
    time_received = message['date']
    if message.is_multipart():
        content = ''.join(str(part.get_payload(decode=True)) for part in message.get_payload())
    else:
        content = message.get_payload(decode=True)
    
    content = str(content)[2:].replace('\\n', '<br/>')
    subject.replace('/', '-')
    fname = subject + " " + time_received + '.html'
        
    with open(the_dir + 'html/' + fname , 'w') as the_file:
        the_file.write(html_header)
        the_file.write('<br/>' + 'From: ' + mess_from)
        the_file.write('<br/>' + 'Subject: ' + subject)
        the_file.write('<br/>' + 'Received: ' + time_received + '<br/><br/>')
        the_file.write(content)

The content of the message has backslashes before apostrophes and other special characters like this:

star rating, currently going for \xa311.99 [ideal Xmas present]. Advert over - Seroiusly, if you don't have a decent book on small boat

My question is, what is the best way to get the email message content and write it to the html file with the correct characters. I can't be the first one to run into this problem.


Solution

  • I found the answer to this question.

    First, I needed to identify html by the subtype (part.get_content_subtype()). That is how I know I have an html subtype.

    Then I needed to get the character set using part.get_charsets(). There is a part.get_charset() but it always returns None so I take the first element of get_charsets()

    The get_payload seems to be bass ackward with the decode=True parameter meaning it will not decode the payload. I then decode the message using the charset I got earlier. Otherwise, I decode it with decode=False.

    If it is text I strip out linefeeds etc and add an html header and then write to the file.

    Next jobs,

    text

    import mailbox
    
    the_dir = "/path/to/mbox/"
    
    mbox = mailbox.mbox(the_dir + "12394334.mbox")
    
    html_footer = "</body></html>"
    html_flag = False
    
    for message in mbox:
    
    mess_from = message['from']
    subject = message['subject']
    time_received = message['date']
    fname = subject + " " + time_received
    fname = fname.replace('/', '-')
    
    if message.is_multipart():
        contents_text = []
        contents_html = []
        for part in message.walk():
            maintype = part.get_content_maintype()
            subtype = part.get_content_subtype()
            if maintype == 'multipart' or maintype == 'message':
                # Reject containers
                continue
            if subtype == 'html':
                enc = part.get_charsets()
                if enc[0] is not None:
                    contents_html.append(part.get_payload(decode=True).decode(enc[0]))
                else:
                    contents_html.append(part.get_payload(decode=False))
            elif subtype == 'text':
                contents_text.append(part.get_payload(decode=False))
            else:       #I will use this to process attachmnents in the future
                continue
            
        if len(contents_html)> 0:
            if len(contents_html)>1:
                print('multiple html')      #This hasn't happened yet
            html_flag = True
            content = '\n\n'.join(contents_html)
              
        else:
            html_flag = False
    else:
        content = message.get_payload(decode=False) 
        content = content.replace('\\n', '<br/>')
        content = content.replace('=\n', '<br/>')        
        content = content.replace('\n', '<br/>')
        content = content.replace('=20', '')
        html_header = f""" <!DOCTYPE html>
        <html>
        <head>
        <title>{fname}</title>
        </head>
        <body>"""      
        content = (html_header + '<br/>' + 
                   'From: ' + mess_from + '<br/>' 
                   + 'Subject: ' + subject + '<br/>' + 
                   'Received: ' + time_received + '<br/><br/>' + 
                   content + html_footer)
    
    
    with open(the_dir + "html/" + fname + ".html", "w") as the_file:
        the_file.write(content)
    

    print('Done!')