I am new to the world of coding, so please bear with me if I misuse terminology or generally do not know what I am talking about. I am doing a research project in which I am trying to scrape public company 10-Ks from sec.gov via EDGAR. I have read various sources, watched various videos, but I found the below reference to be the most relevant to my project, and quite frankly, it is easy for me to follow along with. The explanation for my code begins on page 194, and the code on page 195. I am first attempting to download the index files (image below), which will help me write a code to get 10-Ks specifically. So, I am in the early stages of my project.
This is just a reference of the paper I am using. It is currently on SSRN, so I realize everyone may not have access. I would upload the PDF, but I don't see that as an option. I listed this purely to show I have a source for what I am doing. I can provide screenshots if necessary.
Anand, V., Bochkay, K., Chychyla, R., & Leone, A. J. (2020). Using Python for text analysis in accounting research. Forthcoming, Foundations and Trends in Accounting.
Currently, I have two issues: My code doesn’t work as intended and I appear to be getting blocked by sec.gov. I will first discuss the former first and the latter at the end. When I run the below, it should download both 2018 and 2019 index files at the down_direct path. However, this code only grabs 2018 index files.
The log/IDLE shell results below show a “successful” and unsuccessful run. The unsuccessful run makes me think I have been blocked by sec.gov. It is my understanding that certain websites look for requests from urllib.request and may automatically screen for that. However, sec.gov is researcher friendly as long as you attempt downloads after hours in spaced attempts, both of which I have done (I worked on this from 7pm to 10pm last night and waited 10ish minutes between attempts). So, my questions are
How should I adjust my code to make it run as intended? (i.e., pull all 4 qtrs of the start_year and end_year)
Am I being blocked by sec.gov? If so, can I tweak my code to get around that?
import os
import urllib.request
from pathlib import Path
def get_index(start_year:int, end_year:int, down_direct:str):
start_year = 2018
end_year = 2019
down_direct = r"C:/Users/Documents/Student Files/~Current Student/~RESEARCH/~First Summer Paper/Data/EDGAR/"
print('Retrieving data')
if not os.path.exists(down_direct):
os.makedirs(down_direct)
for year in range(start_year, end_year+1):
for qtr in range(1,5):
url = r"https://www.sec.gov/Archives/edgar/full-index/" + str(year) + '/' + 'QTR' + str(qtr) + '/master.idx'
dl_file = down_direct + 'master' + str(year) + str(qtr) + '.idx'
urllib.request.urlretrieve(url, dl_file)
print('Downloaded', dl_file, end = '\n')
print('Data retrieved')
return
down_direct = os.path.join(Path.home(), 'edgar', 'indexfiles')
get_index(2018, 2019, down_direct)
Successful Run
Retrieving Data
Downloaded C:/Users/Documents/Student Files/~Current Student/~RESEARCH/~First Summer Paper/Data/EDGAR/master20184.idx
Data retrieved
Unsuccessful Run (For sake of space, I only included the error line)
Retrieving Data
urllib.error.HTTPError: HTTP Error 403: Forbidden
I have seen similar posts where people recommend adding the below to code to get around this error, but I am so green I don’t really know how to incorporate it in. Any help is appreciated, and if I need to edit my post with more information, please let me know.
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
import requests
heads = {'Host': 'www.sec.gov', 'Connection': 'close',
'Accept': 'application/json, text/javascript, */*; q=0.01', 'X-Requested-With': 'XMLHttpRequest',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36',
}
def download(year):
for qtr in range(1, 5):
url = f"https://www.sec.gov/Archives/edgar/full-index/{year}/QTR{qtr}/master.idx"
response = requests.get(url, headers=heads)
print(url)
response.raise_for_status()
down_direct = r"C:/Users/Documents/Student Files/~Current Student/~RESEARCH/~First Summer Paper/Data/EDGAR/"
with open(f'{down_direct}/master{year}QTR{qtr}.idx', 'wb') as f:
f.write(response.content)
start_year =2018
end_year = 2019
for i in range(start_year,end_year+1):
download(i)