I have a dataframe with XML in the second column:
Python - atom_acrs.py:1 [Finished in 1.153s] 19/19
FILE_CREATION_DATE FILE_DATA
0 2017-09-06 <?xml version="1.0" encoding="utf-8"?><REPORT ...
1 2017-09-07 <?xml version="1.0" encoding="utf-8"?><REPORT ...
2 2017-10-09 <?xml version="1.0" encoding="utf-8"?><REPORT ...
3 2017-10-10 <?xml version="1.0" encoding="utf-8"?><REPORT ...
4 2017-12-06 <?xml version="1.0" encoding="utf-8"?><REPORT ...
How do I parse the xml in each row and then output it as a table? So that each tag would have a value for each item in the dataframe?
Thanks
Here is a sample of the XML
<?xml version="1.0" encoding="utf-8" ?>
<REPORT xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.datacontract.org/2004/07/CrashReport.DataLayer.v20170201">
<CRSREPORTTIMESTAMP>2020-10-08T06:49:31.813812</CRSREPORTTIMESTAMP>-
<AGENCYIDENTIFIER>MILWAUKEE</AGENCYIDENTIFIER>-
<AGENCYNAME>Milwaukee Police Department</AGENCYNAME>
Assume we have a dataframe similar to your example:
import pandas as pd
df = pd.DataFrame.from_dict({'FILE_CREATION_DATE': ['2017-09-06'], 'FILE_DATA': ['''<?xml version="1.0" encoding="utf-8" ?>
<REPORT xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.datacontract.org/2004/07/CrashReport.DataLayer.v20170201">
<CRSREPORTTIMESTAMP>2020-10-08T06:49:31.813812</CRSREPORTTIMESTAMP>-
<AGENCYIDENTIFIER>MILWAUKEE</AGENCYIDENTIFIER>-
<AGENCYNAME>Milwaukee Police Department</AGENCYNAME></REPORT>''']})
df
FILE_CREATION_DATE FILE_DATA
0 2017-09-06 <?xml version="1.0" encoding="utf-8" ?>\n<REPO...
let's get the possible values from your XML. We'll just take the first row and assume the rest is identical.
import xml.etree.ElementTree as ET
root = ET.fromstring(df['FILE_DATA'][0])
# we need to get rid of the XML namespace, therefore the split by }
columns = [c.tag.split('}', 1)[-1] for c in root]
# convert each XML into a dictionary and asssign to the columns
df[columns] = df['FILE_DATA'].apply(lambda x: pd.Series({c.tag.split('}', 1)[-1]:c.text for c in ET.fromstring(x)}))
df.drop('FILE_DATA', axis=1, inplace=True)
df
FILE_CREATION_DATE CRSREPORTTIMESTAMP AGENCYIDENTIFIER AGENCYNAME
0 2017-09-06 2020-10-08... MILWAUKEE Milwaukee Police Department