pythonxmlrssdecodefeed

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83c' in position 0: surrogates not allowed


I am trying to parse "https://tre.tbe.taleo.net/tre01/ats/servlet/Rss?org=arobpers2&cws=42" but I am getting the error "UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83c' in position 0: surrogates not allowed". I tried looking at other questions with UnicodeEncodeError but this one seems to be different since chardet outputs that the text is ASCII encoded.

import chardet                                                                                                                                                                                                                
import feedparser                                                                                                                                                                                                             
import requests  
url = "https://tre.tbe.taleo.net/tre01/ats/servlet/Rss?org=arobpers2&cws=42"
r = requests.get(url)
print(chardet.detect(r.text.encode())) # Outputs ASCII
feed = feedparser.parse(r.text) # Raises UnicodeEncodeError

Solution

  • I was able to solve the issue using html.unescape():

    import feedparser     
    import html                                                                                                                                                                                                        
    import requests
     
    url = "https://tre.tbe.taleo.net/tre01/ats/servlet/Rss?org=arobpers2&cws=42"
    r = requests.get(url)
    txt = html.unescape(r.text)
    feed = feedparser.parse(txt)