pythonxmlpandas

Parse xml in a dataframe column?


I have a dataframe with XML in the second column:

Python - atom_acrs.py:1 [Finished in 1.153s] 19/19

  FILE_CREATION_DATE                                          FILE_DATA
0         2017-09-06  <?xml version="1.0" encoding="utf-8"?><REPORT ...
1         2017-09-07  <?xml version="1.0" encoding="utf-8"?><REPORT ...
2         2017-10-09  <?xml version="1.0" encoding="utf-8"?><REPORT ...
3         2017-10-10  <?xml version="1.0" encoding="utf-8"?><REPORT ...
4         2017-12-06  <?xml version="1.0" encoding="utf-8"?><REPORT ...

How do I parse the xml in each row and then output it as a table? So that each tag would have a value for each item in the dataframe?

Thanks

Here is a sample of the XML

<?xml version="1.0" encoding="utf-8" ?>
<REPORT xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.datacontract.org/2004/07/CrashReport.DataLayer.v20170201">
<CRSREPORTTIMESTAMP>2020-10-08T06:49:31.813812</CRSREPORTTIMESTAMP>-
<AGENCYIDENTIFIER>MILWAUKEE</AGENCYIDENTIFIER>-
<AGENCYNAME>Milwaukee Police Department</AGENCYNAME>

Solution

  • Assume we have a dataframe similar to your example:

    import pandas as pd
    df = pd.DataFrame.from_dict({'FILE_CREATION_DATE': ['2017-09-06'], 'FILE_DATA': ['''<?xml version="1.0" encoding="utf-8" ?>
    <REPORT xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.datacontract.org/2004/07/CrashReport.DataLayer.v20170201">
    <CRSREPORTTIMESTAMP>2020-10-08T06:49:31.813812</CRSREPORTTIMESTAMP>-
    <AGENCYIDENTIFIER>MILWAUKEE</AGENCYIDENTIFIER>-
    <AGENCYNAME>Milwaukee Police Department</AGENCYNAME></REPORT>''']})
    
    df
    
    FILE_CREATION_DATE  FILE_DATA
    0   2017-09-06      <?xml version="1.0" encoding="utf-8" ?>\n<REPO...
    

    let's get the possible values from your XML. We'll just take the first row and assume the rest is identical.

    import xml.etree.ElementTree as ET
    
    root = ET.fromstring(df['FILE_DATA'][0])
    # we need to get rid of the XML namespace, therefore the split by }
    columns = [c.tag.split('}', 1)[-1] for c in root]
    
    # convert each XML into a dictionary and asssign to the columns
    df[columns] = df['FILE_DATA'].apply(lambda x: pd.Series({c.tag.split('}', 1)[-1]:c.text for c in ET.fromstring(x)}))
    df.drop('FILE_DATA', axis=1, inplace=True) 
    df
    
    
    FILE_CREATION_DATE  CRSREPORTTIMESTAMP          AGENCYIDENTIFIER    AGENCYNAME
    0                   2017-09-06 2020-10-08...    MILWAUKEE           Milwaukee Police Department