pythonemailparsingimapimaplib

Parsing the HTML content in email


I'm trying to write a python script to read my emails. I'm able to get most of the things properly like To, From, Subject. But in the body, I get the text as well as it's HTML code too as shown below.

enter image description here

Below is the part of code that does the extraction of content from the email

email_message = email.message_from_string(raw_email)
print 'To:', email_message['To']
print 'Sent from:', email_message['From']
print 'Date:', email_message['Date']
print 'Subject:', email_message['Subject']
print '*'*30, 'MESSAGE', '*'*30
maintype = email_message.get_content_maintype()
#print maintype

if maintype == 'multipart':
    for part in email_message.get_payload():
            if part.get_content_maintype() == 'text':
                print part.get_payload()
elif maintype == 'text':
    print email_message.get_payload()
print '*'*69

Git link for the complete code: Email-parser

How to get rid of that HTML code and get only the plain text?


Solution

  • The body of the message is MIME-encoded - that's why it contains the text in both plaintext and HTML formats. In order to get just the plaintext of the body, you first need to MIME-decode the message. You can use python's email package to do the MIME-decoding. Also, see this question for more information.

    import email
    import email.policy
    
    with open("example.email", "rb") as f:
        msg = email.message_from_bytes(f.read(), policy=email.policy.default)
    
    for part in msg.iter_parts():
        print(part.get_content()) # print part, decoding quotable