pythonxmlparsinglxmledgar

Parse XML with Python lxml


I am trying to parse a XML using the python library lxml, and would like the resulting output to be in a dataframe. I am relatively new to python and parsing so please bear with me as I outline the problem. The original xml that I am trying to parse is available here

I am interested in obtaining some relevant tags founds in "invstOrSec". Below is a snapshot of one instance of "invstOrSec" where the text accompanying the tag "curCd" is USD.

<?xml version="1.0" encoding="UTF-8"?>
    <invstOrSec>
        <name>NIPPON LIFE INSURANCE</name>
        <lei>549300Y0HHMFW3EVWY08</lei>
        <curCd>USD</curCd>
    <invstOrSec>

This is relatively straightforward, and my current approach involves first defining the relevant tags in a dictionary and then coarse them into a dataframe in a loop.

    import pandas as pd
    from lxml import etree

    # Declare directory
    os.chdir('C:/Users/A1610222/Desktop/Form NPORT/pkg/sec-edgar-filings/0001548717/NPORT-P/0001752724- 
    20-040624')

    # Set root
    xmlfile = "filing-details.xml"
    tree = etree.parse(xmlfile)
    root = tree.getroot()

    # Remove namespace prefixes
    for elem in root.getiterator():
        elem.tag = etree.QName(elem).localname
   
    # Remove unused namespace declarations
    etree.cleanup_namespaces(root)

    # Set path
    invstOrSec = root.xpath('//invstOrSec')

    # Define tags to extract
    vars = {'invstOrSec' : {'name', 'lei', 'curCd'}

    # Extract holdings data
    sec_info =  pd.DataFrame()
    temp = pd.DataFrame()

    for one in invstOrSec:
        for two in one:
            if two.tag in vars['invstOrSec']:
                temp[two.tag] = [two.text]
        sec_info = sec_info.append(temp)  

Here are the top three rows of sec_info

name lei curCd
NIPPON LIFE INSURANCE 549300Y0HHMFW3EVWY08 USD
Lloyds Banking Group PLC 549300PPXHEU2JF0AM85 USD
Enbridge Inc 98TPTUM4IVMFCZBCUR27 USD

However, the xml follows a slightly different structure when the currency is not USD. See the below example.

<?xml version="1.0" encoding="UTF-8"?>
    <invstOrSec>
        <name>ACHMEA BV</name>
        <lei>7245007QUMI1FHIQV531</lei>
        <currencyConditional curCd="EUR" exchangeRt="0.89150400"/>
    <invstOrSec>

This time curCd is replaced with a different tag currencyConditional and it contains attributes as opposed to the text. I am having a hard time trying to account for these cases while keeping my code as general as possible. I hope I have managed to illustrate the problem. Again, please excuse me if this is too elementary. Any help would be much appreciated.


Solution

  • This is one case where you shouldn't try to reinvent the wheel; use tools developed by others...

    import pandas as pd
    import pandas_read_xml as pdx
    
    url = 'https://www.sec.gov/Archives/edgar/data/1548717/000175272420040624/primary_doc.xml'
    
    df = pdx.read_xml(url,['edgarSubmission', 'formData', 'invstOrSecs','invstOrSec'])
    
    #because of the non-US currency column, you have to apply one more contortion:
    df['currencyConditional'] = df['currencyConditional'].apply(lambda x: x.get('@curCd') if not isinstance(x,float) else "NA" )
    df[['name','lei','curCd','currencyConditional']]
    

    Output (partial, obviously) - note the last row:

    168     BNP PARIBAS     R0MUWSFPU8MPRO8K5P83    USD     NA
    169     Societe Generale    O2RNE8IBXP4R0TD8PU41    USD     NA
    170     BARCLAYS PLC    213800LBQA1Y9L22JB70    NaN     GBP