I am trying to parse https://api.lever.co/v0/postings/matchgroup?mode=xml but I am getting the error lxml.etree.XMLSyntaxError: CData section not finished
. It seems like the issue is being caused by the data having Korea characters.
import lxml.etree
import io
import requests
url = "https://api.lever.co/v0/postings/matchgroup?mode=xml"
r = requests.get(url)
f = io.BytesIO(r.content)
parser = lxml.etree.XMLParser(recover=False)
tree = lxml.etree.parse(f,parser) # Raises lxml.etree.XMLSyntaxError
I can change recover to True
but then some of the entries would be missing.
In this case the file is broken due to a non-printable x08
character (^H
)
to fix it:
f = io.BytesIO(r.content.replace(b"\x08", b""))