I'm currently working on the download of the form.idx file from sec.gov for the first quarter of 2016. Since I'm only interested in the 10-Ks, I wanted to download the file as a .csv file and delete the useless rows. I tried to filter by the form type but that didn't work out.
My code so far is the following:
import requests
import os
years = [2016]
quarters = ['QTR1']
base_path = '/Users/xyz/Desktop'
current_dirs = os.listdir(path=base_path)
for yr in years:
if str(yr) not in current_dirs:
os.mkdir('/'.join([base_path, str(yr)]))
current_files = os.listdir('/'.join([base_path, str(yr)]))
for qtr in quarters:
local_filename = f'{yr}-{qtr}.csv'
local_file_path = '/'.join([base_path, str(yr), local_filename])
if local_filename in current_files:
print(f'Skipping file for {yr}, {qtr} because it is already saved.')
continue
url = f'https://www.sec.gov/Archives/edgar/full-index/{yr}/{qtr}/form.idx'
r = requests.get(url, stream=True)
with open(local_file_path, 'wb') as f:
for chunk in r.iter_content(chunk_size=128):
f.write(chunk)
r2 = pd.read_csv('/Users/xyz/Desktop/2016-QTR1.csv', sep=";", encoding="utf-8")
r2.head()
filt = (r2 ['Form Type'] == '10-K')
r2_10K = r2.loc[filt]
r2_10K.head()
r2_10K.to_csv('/Users/xyz/Desktop/modified.csv')
The Error message I get is:
Traceback (most recent call last):
File "<ipython-input-5-f84e3f81f3d1>", line 61, in <module>
filt = (r2 ['Form Type'] == '10-K')
File "/Users/xyz/opt/anaconda3/envs/spyder-4.1.5_1/lib/python3.8/site-packages/pandas/core/frame.py", line 2906, in __getitem__
indexer = self.columns.get_loc(key)
File "/Users/xyz/opt/anaconda3/envs/spyder-4.1.5_1/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc
raise KeyError(key) from err
KeyError: 'Form Type'
Maybe there's a way to just delete the rows I don't need in the file? Otherwise, I'm also thankful for any kind of help on that problem.
Many thanks in advance.
Kind regards, Elena
This is the full working code for you, The main issue was on your csv format that you're getting from online, Full code: https://rextester.com/QUGF24653
What I did:
import requests import os import pandas as pd years = [2016] quarters = ['QTR1'] base_path = '/Users/xyz/Desktop' current_dirs = os.listdir(path=base_path) for yr in years: if str(yr) not in current_dirs: os.mkdir('/'.join([base_path, str(yr)])) current_files = os.listdir('/'.join([base_path, str(yr)])) for qtr in quarters: local_filename = f'{yr}-{qtr}.csv' local_file_path = '/'.join([base_path, str(yr), local_filename]) if local_filename in current_files: print(f'Skipping file for {yr}, {qtr} because it is already saved.') continue url = f'https://www.sec.gov/Archives/edgar/full-index/{yr}/{qtr}/form.idx' r = requests.get(url, stream=True) with open(local_file_path, 'wb') as f: for chunk in r.iter_content(chunk_size=128): f.write(chunk) colnames=['Form Type', 'Company Name', 'CIK', 'Date Filed','File Name'] r2 = pd.read_csv('/Users/xyz/Desktop/2016-QTR1.csv', sep=r'\s{3,}', skiprows=10, encoding="utf-8", names=colnames,header=None) r2[['Date Filed','File Name']] = r2['Date Filed'].str.split(expand=True) filtered = (r2['Form Type'] == '10-K') r2_10K = r2.loc[filtered] print(r2_10K.head())
Output:
Form Type Company Name CIK Date Filed File Name
2181 10-K 1347 Capital Corp 1606163 2016-03-21 edgar/data/1606163/0001144204-16-089184.txt
2182 10-K 1347 Property Insurance Holdings, Inc. 1591890 2016-03-17 edgar/data/1591890/0001387131-16-004603.txt
2183 10-K 1ST CONSTITUTION BANCORP 1141807 2016-03-22 edgar/data/1141807/0001141807-16-000010.txt
2184 10-K 1ST SOURCE CORP 34782 2016-02-19 edgar/data/34782/0000034782-16-000102.txt
2185 10-K 1st Century Bancshares, Inc. 1420525 2016-03-04 edgar/data/1420525/0001437749-16-026765.txt