[SOLVED] pandas read_xml missing data

pandas read_xml missing data

I have tried using Pandas read_xml and it reads most of the XML fine but it leaves some parts out because its in a slightly different format. I have included an extract below and it reads "Type", "Activation" fine but doesn't for the "Amt" value. It picks up the column heading "Amt" just not the value. Could anyone point me in the right direction on how to get it to read it. Thanks

<Type>PYI</Type>
<Activation>N</Activation>
<Amt val="4000" curr="GBP"/>

xml_df = pd.read_xml(xml_data)

Anybody able to help I have tried going through the documentation for Pandas.read_xml but I can see why it wouldn't pick this up?

Solution

By default, pandas.read_xml parses all the immediate descendants of a set of nodes including its child nodes and attributes. Unless, the xpath argument indicates it, read_xml will not go further than immediate descendants.

To illustrate your use case. Below is likely the generalized set up of your XML where <Type> and its siblings, <Activation> and <Amt> are parsed. However, <Amt> does not contain a text node, only attributes. So the value in that column should be empty.

<root>
  <row>
    <Type>PYI</Type>              <!-- Type IS A CHILD NODE OF row -->
    <Activation>N</Activation>    <!-- Activation IS A CHILD NODE OF row -->
    <Amt val="4000" curr="GBP"/>  <!-- Amt IS A CHILD NODE OF row -->
  </row>
</root>

But then you ask, why did read_xml ignore the val and curr attributes? Because each are not an immediate descendant of <row>. They are descendants of <Amt> (i.e., grandchildren of <row>). If attributes were moved to <row>, then they will be captured as shown below:

<root>
  <row val="4000" curr="GBP">     <!-- val AND curr ARE CHILD ATTRIBS OF row -->
    <Type>PYI</Type>              <!-- Type IS A CHILD NODE OF row -->
    <Activation>N</Activation>    <!-- Activation IS A CHILD NODE OF row -->
    <Amt/>                        <!-- Amt IS A CHILD NODE OF row -->
  </row>
</root>

To capture those attributes, adjust xpath argument to point to its immediate parent:

amt_df = pd.read_xml("Input.xml", xpath="//Amt")

To have such attributes captured with <row> level information, consider the special-purpose language, XSLT, to transform your original XML to the following:

<?xml version="1.0" encoding="UTF-8"?>
<root>
   <row>
      <Type>PYI</Type>
      <Activation>N</Activation>
      <Amt_val>4000</Amt_val>
      <Amt_curr>GBP</Amt_curr>
   </row>
</root>

Above is the intermediate output that is parsed by read_xml when using the stylesheet argument as shown below:

xsl = '''<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes"/>
  <xsl:strip-space elements="*"/>
  
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>
  
    <xsl:template match="row">
    <xsl:copy>
      <xsl:copy-of select="*[name() != 'Amt']"/>
      <Amt_val><xsl:value-of select="Amt/@val"/></Amt_val>
      <Amt_curr><xsl:value-of select="Amt/@curr"/></Amt_curr>
    </xsl:copy>
  </xsl:template>
  
</xsl:stylesheet>'''

row_df = pd.read_xml("Input.xml", xpath="//row", stylesheet=xsl")