[SOLVED] Parsing the HTML content in email

Parsing the HTML content in email

I'm trying to write a python script to read my emails. I'm able to get most of the things properly like To, From, Subject. But in the body, I get the text as well as it's HTML code too as shown below.

Below is the part of code that does the extraction of content from the email

email_message = email.message_from_string(raw_email)
print 'To:', email_message['To']
print 'Sent from:', email_message['From']
print 'Date:', email_message['Date']
print 'Subject:', email_message['Subject']
print '*'*30, 'MESSAGE', '*'*30
maintype = email_message.get_content_maintype()
#print maintype

if maintype == 'multipart':
    for part in email_message.get_payload():
            if part.get_content_maintype() == 'text':
                print part.get_payload()
elif maintype == 'text':
    print email_message.get_payload()
print '*'*69

Git link for the complete code: Email-parser

How to get rid of that HTML code and get only the plain text?

Solution

The body of the message is MIME-encoded - that's why it contains the text in both plaintext and HTML formats. In order to get just the plaintext of the body, you first need to MIME-decode the message. You can use python's email package to do the MIME-decoding. Also, see this question for more information.

import email
import email.policy

with open("example.email", "rb") as f:
    msg = email.message_from_bytes(f.read(), policy=email.policy.default)

for part in msg.iter_parts():
    print(part.get_content()) # print part, decoding quotable