I have a requirement where I have extract XML with in CDATA with in XML. I am able to extract XML tags, but not XML tags in CDATA.
I need to extract
Below is the XML sample I am working with.
<B2B_DATA>
<B2B_METADATA>
<EventId>122157660</EventId>
<MessageType>Request</MessageType>
</B2B_METADATA>
<PAYLOAD>
<![CDATA[<?xml version="1.0"?>
<REQUEST_GROUP MISMOVersionID="1.1.1">
<REQUESTING_PARTY _Name="CityBank" _StreetAddress="801 Main St" _City="rockwall" _State="MD" _PostalCode="11311" _Identifier="416">
<CONTACT_DETAIL _Name="XX Davis">
<CONTACT_POINT _Type="Phone" _Value="1236573348"/>
<CONTACT_POINT _Type="Email" _Value="jXX@city.com"/>
</CONTACT_DETAIL>
</REQUESTING_PARTY>
</REQUEST_GROUP>]]>
</PAYLOAD>
</B2B_DATA>
I have tried this -
tree = ElementTree.parse('file.xml')
root = tree.getroot()
for child in root:
print(child.tag)
O/P B2B_METADATA PAYLOAD
Not able to parse inside PAYLOAD.
Any help is greatly appreciated.
What you need to do, in this case, is parse the outer xml, extract the xml in the CDATA, parse that inner xml and extract the target data from that.
I personally would use lxml and xpath, not ElementTree:
from lxml import etree
root = etree.parse('file.xml')
#step one: extract the cdata as a string
cd = root.xpath('//PAYLOAD//text()')[0].strip()
#step 2 - parse the cdata string as xml
doc = etree.XML(cd)
#finally, extract the target data
doc.xpath('//REQUESTING_PARTY//CONTACT_POINT[@_Type="Phone"]/@_Value')[0]
Output, based on your sample xml above:
'1236573348'