pythonxmlunicodedecodefeed

lxml.etree.XMLSyntaxError for Korean Charachters


I am trying to parse https://api.lever.co/v0/postings/matchgroup?mode=xml but I am getting the error lxml.etree.XMLSyntaxError: CData section not finished. It seems like the issue is being caused by the data having Korea characters.

import lxml.etree                                                                                             
import io                                                                                                     
import requests
url = "https://api.lever.co/v0/postings/matchgroup?mode=xml"
r = requests.get(url)
f = io.BytesIO(r.content)
parser = lxml.etree.XMLParser(recover=False)                                                                                                                                                             
tree = lxml.etree.parse(f,parser) # Raises lxml.etree.XMLSyntaxError

I can change recover to True but then some of the entries would be missing.


Solution

  • In this case the file is broken due to a non-printable x08 character (^H)

    to fix it:

    f = io.BytesIO(r.content.replace(b"\x08", b""))