I'm trying to parse the below (which came from a gmail email body) using feedparser by feeding it in as a raw text feed. I'll be displaying it on a page so I want it to look like it did in the email body, and am using feedparser to help sanitize the content.
Original email body from GMail, where each testing# (e.g. testing1, testing2, etc.) is on a new line:
<div dir="ltr">testing1<div>testing2</div><div>testing3</div><div>testing4</div><div>testing5</div><div>testing6</div><div> testing1<div>testing2</div><div>testing3</div><div>testing4</div><div>testing5</div><d
iv>testing6</div><div> testing1<div>testing2</div><div>testing3</div><div>testing4</div><div>testing5</div><div>testing6</div></div><div> testing1<div>testing2</div><div>testing3</div><div>testing4</div><div>t
esting5</div><div>testing6</div></div><div><br></div>-- <br><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr">Bob Johnson<br>(123) 456-7890</div></div></div></div>
I then feed it into an RSS feed like so:
<rss version="2.0">
<channel>
<title>Created Feed</title>
<item><title>test2</title>
<pubDate>Sun, 23 May 2021 21:38:03 -0400</pubDate>
<link>elysiumreader+yac41zdxipljzcfl@gmail.comSun23May20212138030400</link>
<content><div dir="ltr">testing1<div>testing2</div><div>testing3</div><div>testing4</div><div>testing5</div><div>testing6</div><div> testing1<div>testing2</div><div>testing3</div><div>testing4</div><div>testing5</div><d
iv>testing6</div><div> testing1<div>testing2</div><div>testing3</div><div>testing4</div><div>testing5</div><div>testing6</div></div><div> testing1<div>testing2</div><div>testing3</div><div>testing4</div><div>t
esting5</div><div>testing6</div></div><div><br></div>-- <br><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr">Bob Johnson<br>(123) 456-7890</div></div></div></div>
</content>
</channel>
</rss>
This ultimately gets parsed like the below, where it loses the line breaks between the lines, and eventually it looks like the
tag throws it off and it stops parsing properly.
{'title': 'test2', 'title_detail': {'type': 'text/plain', 'language': None, 'base': '', 'value': 'test2'}, 'published': 'Sun, 23 May 2021 21:38:03 -0400', 'published_parsed': time.struct_time(tm_year=2021, tm_mo
n=5, tm_mday=24, tm_hour=1, tm_min=38, tm_sec=3, tm_wday=0, tm_yday=144, tm_isdst=0), 'links': [{'rel': 'alternate', 'type': 'text/html', 'href': 'www.test.com/blablabla'}
], 'link': 'www.test.com/blablabla', 'content': [{'type': 'application/xhtml+xml', 'language': None, 'base': '', 'value': 'testing1testing2testing3testing4testing5testing6
\xa0 testing1testing2testing3testing4testing5testing6\xa0 testing1testing2testing3testing4testing5testing6\xa0 testing1testing2testing3testing4testing5testing6<br /></div>-- <br /><div class="gmail_signature" di
r="ltr"><div dir="ltr">Bob Johnson<br />(123) 456-7890</div></div></div></div>'}], 'summary': 'testing1testing2testing3testing4testing5testing6\xa0 testing1testing2testing3testing4testing5testing6\xa0 testing1tes
ting2testing3testing4testing5testing6\xa0 testing1testing2testing3testing4testing5testing6<br /></div>-- <br /><div class="gmail_signature" dir="ltr"><div dir="ltr">Bob Johnson<br />(123) 456-7890</div></div></div></div>'}
Any thoughts or ideas would be appreciated. Thanks
RSS is basically an XML document, so it must follow fairly strict rules. The HTML markup you're pulling from Gmail is breaking the RSS document. If you want the feed to include HTML markup you need to encode HTML entities or wrap the content in CDATA section before you add it to the RSS feed.
To encode HTML entities using standard library:
>>> from html import escape
>>> escape('<div dir="ltr">testing1<div>testing2')
'<div dir="ltr">testing1<div>testing2'
and for CDATA:
>>> body='<div dir="ltr">testing1<div>testing2'
>>> print(f"<![CDATA[{body}]]>")
<![CDATA[<div dir="ltr">testing1<div>testing2]]>
Both options only affect the markup and even though it might look scary it will display correctly to end user.