I am trying to parse a XML using the python library lxml, and would like the resulting output to be in a dataframe. I am relatively new to python and parsing so please bear with me as I outline the problem. The original xml that I am trying to parse is available here
I am interested in obtaining some relevant tags founds in "invstOrSec". Below is a snapshot of one instance of "invstOrSec" where the text accompanying the tag "curCd" is USD.
<?xml version="1.0" encoding="UTF-8"?>
<invstOrSec>
<name>NIPPON LIFE INSURANCE</name>
<lei>549300Y0HHMFW3EVWY08</lei>
<curCd>USD</curCd>
<invstOrSec>
This is relatively straightforward, and my current approach involves first defining the relevant tags in a dictionary and then coarse them into a dataframe in a loop.
import pandas as pd
from lxml import etree
# Declare directory
os.chdir('C:/Users/A1610222/Desktop/Form NPORT/pkg/sec-edgar-filings/0001548717/NPORT-P/0001752724-
20-040624')
# Set root
xmlfile = "filing-details.xml"
tree = etree.parse(xmlfile)
root = tree.getroot()
# Remove namespace prefixes
for elem in root.getiterator():
elem.tag = etree.QName(elem).localname
# Remove unused namespace declarations
etree.cleanup_namespaces(root)
# Set path
invstOrSec = root.xpath('//invstOrSec')
# Define tags to extract
vars = {'invstOrSec' : {'name', 'lei', 'curCd'}
# Extract holdings data
sec_info = pd.DataFrame()
temp = pd.DataFrame()
for one in invstOrSec:
for two in one:
if two.tag in vars['invstOrSec']:
temp[two.tag] = [two.text]
sec_info = sec_info.append(temp)
Here are the top three rows of sec_info
name | lei | curCd |
---|---|---|
NIPPON LIFE INSURANCE | 549300Y0HHMFW3EVWY08 | USD |
Lloyds Banking Group PLC | 549300PPXHEU2JF0AM85 | USD |
Enbridge Inc | 98TPTUM4IVMFCZBCUR27 | USD |
However, the xml follows a slightly different structure when the currency is not USD. See the below example.
<?xml version="1.0" encoding="UTF-8"?>
<invstOrSec>
<name>ACHMEA BV</name>
<lei>7245007QUMI1FHIQV531</lei>
<currencyConditional curCd="EUR" exchangeRt="0.89150400"/>
<invstOrSec>
This time curCd is replaced with a different tag currencyConditional and it contains attributes as opposed to the text. I am having a hard time trying to account for these cases while keeping my code as general as possible. I hope I have managed to illustrate the problem. Again, please excuse me if this is too elementary. Any help would be much appreciated.
This is one case where you shouldn't try to reinvent the wheel; use tools developed by others...
import pandas as pd
import pandas_read_xml as pdx
url = 'https://www.sec.gov/Archives/edgar/data/1548717/000175272420040624/primary_doc.xml'
df = pdx.read_xml(url,['edgarSubmission', 'formData', 'invstOrSecs','invstOrSec'])
#because of the non-US currency column, you have to apply one more contortion:
df['currencyConditional'] = df['currencyConditional'].apply(lambda x: x.get('@curCd') if not isinstance(x,float) else "NA" )
df[['name','lei','curCd','currencyConditional']]
Output (partial, obviously) - note the last row:
168 BNP PARIBAS R0MUWSFPU8MPRO8K5P83 USD NA
169 Societe Generale O2RNE8IBXP4R0TD8PU41 USD NA
170 BARCLAYS PLC 213800LBQA1Y9L22JB70 NaN GBP