pythonedgarsec

How should I scrape an idx file on EDGAR?


I have an idx file: https://www.sec.gov/Archives/edgar/daily-index/2020/QTR4/master.20201231.idx

I could open the idx file with following codes one year ago, but the codes don't work now. Why is that? How should I modify the code?

import requests
import urllib
from bs4 import BeautifulSoup

master_data = []
file_url = r"https://www.sec.gov/Archives/edgar/daily-index/2020/QTR4/master.20201231.idx"
byte_data = requests.get(file_url).content
data_format = byte_data.decode('utf-8').split('------')
content = data_format[-1]
data_list = content.replace('\n','|').split('|')

    for index, item in enumerate(data_list):

        if '.txt' in item:
            if data_list[index - 2] == '10-K':
                entry_list = data_list[index - 4: index + 1]
                entry_list[4] = "https://www.sec.gov/Archives/" + entry_list[4]
                master_data.append(entry_list)

print(master_data)

Solution

  • If you had inspected the contents of the byte_data variable, you would find that it does not have the actual content of the idx file. It is basically present to prevent scraping bots like yours. You can find more information in this answer: Problem HTTP error 403 in Python 3 Web Scraping

    In this case, your answer would be to just use the User-Agent in the header for the request.

    import requests
    
    master_data = []
    file_url = r"https://www.sec.gov/Archives/edgar/daily-index/2020/QTR4/master.20201231.idx"
    byte_data = requests.get(file_url, allow_redirects=True, headers={"User-Agent": "XYZ/3.0"}).content
    
    # Your further processing here
    

    On a side note, your processing does not print anything as the if condition is never met for any of the lines, so do not think this solution does not work.