pythonhtmlbeautifulsouptext-extractionedgar

Extracting text section from (Edgar 10-K filings) HTML


I am trying to extract a certain section from HTML-files. To be specific, I look for the "ITEM 1" Section of the 10-K filings (a US business reports of a certain company). E.g.: https://www.sec.gov/Archives/edgar/data/1591890/000149315218003887/form10-k.htm#a_002

Problem: However, I am not able to find the "ITEM 1" section, nor do I have an idea how to tell my algorithm to search from that point "ITEM 1" to another point (e.g. "ITEM 1A") and extract the text in between.

I am super thankful for any help.

Among others, I have tried this (and similar), but my bd is always empty:

    try:
        # bd = soup.body.findAll(text=re.compile('^ITEM 1$'))
        # bd = soup.find_all(name="ITEM 1")
        # bd = soup.find_all(["ITEM 1", "ITEM1", "Item 1", "Item1", "item 1", "item1"])

        print(" Business Section (Item 1): ", bd.content)

    except:
        print("\n Section not found!")

Using Python 3.7 and Beautifulsoup4

Regards Heka


Solution

  • As I mentioned in a comment, because of the nature of EDGAR, this may work on one filing but fail on another. The principles, though, should generally work (after some adjustments...)

    import requests
    import lxml.html
    
    url = 'https://www.sec.gov/Archives/edgar/data/1591890/000149315218003887/form10-k.htm#a_002'
    source = requests.get(url)
    doc = lxml.html.fromstring(source.text)
    
    tabs = doc.xpath('//table[./tr/td/font/a[@name="a_002"]]/following-sibling::p/font')
    #in this filing, Item 1 is hiding in a series of <p> tags following a table with an <a> tag with a 
    #"name" attribute which has a value of "a_002"
    flag = ''
    for i in tabs:
        if flag == 'stop':
            break
        if i.text is not None: #we now start extracting the text from each <p> tag and move to the next
            print(i.text_content().strip().replace('\n',''))
        nxt = i.getparent().getnext()
        #the following detects when the <p> tags of Item 1 end and the next Item begins and then stops 
        if str(type(nxt)) != "<class 'NoneType'>" and nxt.tag == 'table':
            for j in nxt.iterdescendants():
               if j.tag == 'a' and j.values()[0]=='a_003':
                     # we have encountered the <a> tag with a "name" attribute which has a value of "a_003", indicated the beginning of the next Item; so we stop
                     flag='stop'           
    

    The output is the text of Item 1 in this filing.