pythonxmledgar

How to parse 10-Q reports from EDGAR API in python?


I'm trying to use EDGAR API to retrieve 10-Q for any given company (corresponding to the CIK value provided.) This code retrieves the most recent 10-Q for Tesla. There are about 30 methods attached to this object, such as keys, values, items, and text_content. Text_content appears to be the only one that does not return an empty list []. However, text is not easy to parse because the 10-Q varies considerably from one company to another.

Undoubtedly, someone will comment: Why did I set no_of_documents=2? If this parameter is set to 1, the wrong document (not 10-Q) will be returned. With any parameter over 1, actual 10-Qs will be retrieved. I have no idea why the API behaves this way.

from edgar import Company
def func(cik):
    company = Company("",cik)
    tree = company.get_all_filings(filing_type="10-Q")        
    documents = Company.get_documents(tree,no_of_documents=2)
    return documents[0]
    
test = func('0001318605')

What I'd like to do is (A) print out raw XML to take a peek at its underlying structure, then parse with either xmltodict or xml.etree.ElementTree. However, I'm receiving the following errors.

Using ET

import xml.etree.ElementTree as ET
ET.parse(test)
>>>
TypeError: expected str, bytes or os.PathLike object, not HtmlElement

Using XMLtoDict

import xmltodict
xmltodict.parse(test)
TypeError: a bytes-like object is required, not 'HtmlElement'

Again my goal is to search for navigate the XML content, however, without knowing what the tags are, I'm a bit stuck. How can I proceed?


Solution

  • You don't need to parse test; you can use xpath methods directly on it. For example:

    test.xpath('//entity/segment/explicitmember/text()')
    

    Outputs:

     'tsla:OperatingLeaseVehiclesMember',
     'tsla:OperatingLeaseVehiclesMember',
     'tsla:SolarEnergySystemsMember',
     'tsla:SolarEnergySystemsMember',
     'tsla:AutomotiveSegmentMember',
     'tsla:AutomotiveSegmentMember',
    

    etc. and

    test.xpath('//context/period/instant/text()')
    

    outputs:

     ['2020-07-20',
     '2020-06-30',
     '2019-12-31',
     '2020-06-30',
     '2019-12-31',
    

    and so on.

    Good luck; parsing xbrl filings is not an easy task...