I'm trying to write a python script to read my emails.
I'm able to get most of the things properly like To
, From
, Subject
.
But in the body
, I get the text as well as it's HTML code too as shown below.
Below is the part of code that does the extraction of content from the email
email_message = email.message_from_string(raw_email)
print 'To:', email_message['To']
print 'Sent from:', email_message['From']
print 'Date:', email_message['Date']
print 'Subject:', email_message['Subject']
print '*'*30, 'MESSAGE', '*'*30
maintype = email_message.get_content_maintype()
#print maintype
if maintype == 'multipart':
for part in email_message.get_payload():
if part.get_content_maintype() == 'text':
print part.get_payload()
elif maintype == 'text':
print email_message.get_payload()
print '*'*69
Git link for the complete code: Email-parser
How to get rid of that HTML code and get only the plain text?
The body of the message is MIME-encoded - that's why it contains the text in both plaintext and HTML formats. In order to get just the plaintext of the body, you first need to MIME-decode the message. You can use python's email package to do the MIME-decoding. Also, see this question for more information.
import email
import email.policy
with open("example.email", "rb") as f:
msg = email.message_from_bytes(f.read(), policy=email.policy.default)
for part in msg.iter_parts():
print(part.get_content()) # print part, decoding quotable