pythonxmllxmlcdata

Parsing XML document that includes another XML document embedded in a CDATA section


I'm trying out web scraping for the first time using lxml.etree. The website I want to scrape has an XML feed, which I can read fine, except for a part of its XML which is embedded within a CDATA section:

from lxml import etree

parser = etree.XMLParser(recover=True)

data=b'''<?xml version="1.0" encoding="UTF-8"?>
<feed>
  <entry>
    <summary type="xhtml"><![CDATA[<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
        <REMITUrgentMarketMessages>
            <UMM>
                <messageId>2023-86___________________001</messageId>
                <event>
                    <eventStatus>Active</eventStatus>
                    <eventType>Other unavailability</eventType>
                    <eventStart>2023-09-07T06:00:00.000+02:00</eventStart>
                    <eventStop>2023-09-10T06:00:00.000+02:00</eventStop>
                </event>
                <unavailabilityType>Planned</unavailabilityType>
                <publicationDateTime>2022-10-06T13:42:00.000+02:00</publicationDateTime>
                <capacity>
                    <unitMeasure>mcm/d</unitMeasure>
                    <unavailableCapacity>9.0</unavailableCapacity>
                    <availableCapacity>0.0</availableCapacity>
                    <technicalCapacity>9.0</technicalCapacity>
                </capacity>
                <unavailabilityReason>Yearly maintenance</unavailabilityReason>
                <remarks>Uncertain duration</remarks>
                <balancingZone>21Y000000000024I</balancingZone>
                <balancingZone>21Y0000000001278</balancingZone>
                <balancingZone>21YGB-UKGASGRIDW</balancingZone>
                <balancingZone>21YNL----TTF---1</balancingZone>
                <balancingZone>37Y701125MH0000I</balancingZone>
                <balancingZone>37Y701133MH0000P</balancingZone>
                <affectedAsset>
                    <ns2:name>Dvalin</ns2:name>
                </affectedAsset>
                <marketParticipant>
                    <ns2:name>Gassco AS</ns2:name>
                    <ns2:eic>21X-NO-A-A0A0A-2</ns2:eic>
                </marketParticipant>
            </UMM>
        </REMITUrgentMarketMessages>]]></summary>
  </entry>
</feed>
'''

tree = etree.fromstring(data)
block = tree.xpath("/feed/entry/summary")[0]

block_str = "b'''"+block.text+"'''"

tree_in_tree = etree.fromstring(block_str)

The problem the XML code in the CDATA section is weirdly indented, meaning that if I just pass the CDATA content into a string and then read it with etree (like I do below), I get a message error because of indentation.

This is the message:

XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

Basically I understand that the indentation between the first line of CDATA and REMITUrgentMarketMessages is badly indented.

Does anyone know how to fix this? :)

Thanks for the help!


Solution

  • The b prefix is used for bytes literals, but block.text is not a literal. Instead, create the bytes object (representing the embedded XML document) using bytes():

    block_str = bytes(block.text, "UTF-8")
    

    Now when the program is run, you will get the following error:

    lxml.etree.XMLSyntaxError: Namespace prefix ns2 on name is not defined

    That is a serious error, but it can be bypassed by using the parser configured with recover=True:

    tree_in_tree = etree.fromstring(block_str, parser)