pythonpandasxmlreadxml

Parsing nested children nodes using pandas.read_xml


I would like to import an xml with nested structure into a pandas dataframe. I include a sample xml

<?xml version='1.0' encoding='utf-8'?>
<AuditFile xmlns="urn:OECD:StandardAuditFile-Taxation/2.00">
    <MasterFiles>
        <Customers>
            <Customer>
                <RegistrationNumber>FR16270524</RegistrationNumber>
                <Name>Test guy S.A</Name>
                <Address>
                    <StreetName>1, av des Champs Elysées</StreetName>
                    <City>France</City>
                    <PostalCode>75000</PostalCode>
                    <Country>FR</Country>
                </Address>
                <Contact>
                    <ContactPerson>
                        <FirstName>Boom</FirstName>
                        <LastName>Baker</LastName>
                    </ContactPerson>
                    <Telephone>+331523526</Telephone>
                    <Email>boom.baker@sample.com</Email>
                </Contact>
                <TaxRegistration>
                    <TaxRegistrationNumber>FR16270524</TaxRegistrationNumber>
                    <TaxNumber>FR16270524</TaxNumber>
                </TaxRegistration>
                <CustomerID>2800002252</CustomerID>
                <AccountID>400100</AccountID>
                <OpeningDebitBalance>32.76</OpeningDebitBalance>
                <ClosingDebitBalance>0.0</ClosingDebitBalance>
            </Customer>
        </Customers>        
    </MasterFiles>      
</AuditFile>

By 'flatten', I mean that I would like that every end node corresponds to a column in the dataframe. The first column names will then be RegistrationNumber, Name, StreetName, etc...

Pandas' documentation of method read_xml () mentions, in the documentation of parameter elems_only, that, "by default, all child elements and non-empty text nodes are returned."

This is not the case if the structure contains nested children. Unlike Excel, only the first level of nodes is imported, not the nested ones.

I read a similar question on Javascript, where it was necessary to "flatten" the xml before import in a dataframe. I also looked at previous question Flatten XML data as a pandas dataframe, but the solution provided using XSLT is heavy and beyond my skills.

2 questions:

  1. Is there a pandas' functionality that addresses those nested xmls that I missed (like Excel does)?
  2. if this is not the case, is there an easier way to flatten or should I define the flat format manually (e.g by parsing into a dict and redefining the key as suggested in the stackoverflow article mentioned above?

Solution

  • You can use * as wildcard in xpath and // to dive deeper:

    import pandas as pd
    
    xml="""your xml here ..."""
    
    df = pd.read_xml(xml, xpath= './/*')
    print(df)