I would like to import an xml with nested structure into a pandas dataframe. I include a sample xml
<?xml version='1.0' encoding='utf-8'?>
<AuditFile xmlns="urn:OECD:StandardAuditFile-Taxation/2.00">
<MasterFiles>
<Customers>
<Customer>
<RegistrationNumber>FR16270524</RegistrationNumber>
<Name>Test guy S.A</Name>
<Address>
<StreetName>1, av des Champs Elysées</StreetName>
<City>France</City>
<PostalCode>75000</PostalCode>
<Country>FR</Country>
</Address>
<Contact>
<ContactPerson>
<FirstName>Boom</FirstName>
<LastName>Baker</LastName>
</ContactPerson>
<Telephone>+331523526</Telephone>
<Email>boom.baker@sample.com</Email>
</Contact>
<TaxRegistration>
<TaxRegistrationNumber>FR16270524</TaxRegistrationNumber>
<TaxNumber>FR16270524</TaxNumber>
</TaxRegistration>
<CustomerID>2800002252</CustomerID>
<AccountID>400100</AccountID>
<OpeningDebitBalance>32.76</OpeningDebitBalance>
<ClosingDebitBalance>0.0</ClosingDebitBalance>
</Customer>
</Customers>
</MasterFiles>
</AuditFile>
By 'flatten', I mean that I would like that every end node corresponds to a column in the dataframe. The first column names will then be RegistrationNumber, Name, StreetName, etc...
Pandas' documentation of method read_xml () mentions, in the documentation of parameter elems_only, that, "by default, all child elements and non-empty text nodes are returned."
This is not the case if the structure contains nested children. Unlike Excel, only the first level of nodes is imported, not the nested ones.
I read a similar question on Javascript, where it was necessary to "flatten" the xml before import in a dataframe. I also looked at previous question Flatten XML data as a pandas dataframe, but the solution provided using XSLT is heavy and beyond my skills.
2 questions:
You can use *
as wildcard in xpath and //
to dive deeper:
import pandas as pd
xml="""your xml here ..."""
df = pd.read_xml(xml, xpath= './/*')
print(df)