I am trying to parse "https://tre.tbe.taleo.net/tre01/ats/servlet/Rss?org=arobpers2&cws=42" but I am getting the error "UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83c' in position 0: surrogates not allowed". I tried looking at other questions with UnicodeEncodeError but this one seems to be different since chardet outputs that the text is ASCII encoded.
import chardet
import feedparser
import requests
url = "https://tre.tbe.taleo.net/tre01/ats/servlet/Rss?org=arobpers2&cws=42"
r = requests.get(url)
print(chardet.detect(r.text.encode())) # Outputs ASCII
feed = feedparser.parse(r.text) # Raises UnicodeEncodeError
I was able to solve the issue using html.unescape()
:
import feedparser
import html
import requests
url = "https://tre.tbe.taleo.net/tre01/ats/servlet/Rss?org=arobpers2&cws=42"
r = requests.get(url)
txt = html.unescape(r.text)
feed = feedparser.parse(txt)