pythoncsvfilteringsec

Is there a way in python to delete several rows in an csv file?


I'm currently working on the download of the form.idx file from sec.gov for the first quarter of 2016. Since I'm only interested in the 10-Ks, I wanted to download the file as a .csv file and delete the useless rows. I tried to filter by the form type but that didn't work out.

My code so far is the following:

import requests
import os

years = [2016]

quarters = ['QTR1']

base_path = '/Users/xyz/Desktop'

current_dirs = os.listdir(path=base_path)

for yr in years:
    if str(yr) not in current_dirs:
        os.mkdir('/'.join([base_path, str(yr)]))
    
    current_files = os.listdir('/'.join([base_path, str(yr)]))
    
    for qtr in quarters:
        local_filename =  f'{yr}-{qtr}.csv'
        
    
        local_file_path = '/'.join([base_path, str(yr), local_filename])
        
        if local_filename in current_files:
            print(f'Skipping file for {yr}, {qtr} because it is already saved.')
            continue
        
        url = f'https://www.sec.gov/Archives/edgar/full-index/{yr}/{qtr}/form.idx'
        
        r = requests.get(url, stream=True)
        with open(local_file_path, 'wb') as f:
            for chunk in r.iter_content(chunk_size=128):
                f.write(chunk)

r2 = pd.read_csv('/Users/xyz/Desktop/2016-QTR1.csv', sep=";", encoding="utf-8")
r2.head()
filt = (r2 ['Form Type'] == '10-K')
r2_10K = r2.loc[filt]
r2_10K.head()
r2_10K.to_csv('/Users/xyz/Desktop/modified.csv')

The Error message I get is:
Traceback (most recent call last):

  File "<ipython-input-5-f84e3f81f3d1>", line 61, in <module>
    filt = (r2 ['Form Type'] == '10-K')

  File "/Users/xyz/opt/anaconda3/envs/spyder-4.1.5_1/lib/python3.8/site-packages/pandas/core/frame.py", line 2906, in __getitem__
    indexer = self.columns.get_loc(key)

  File "/Users/xyz/opt/anaconda3/envs/spyder-4.1.5_1/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc
    raise KeyError(key) from err

KeyError: 'Form Type'

Maybe there's a way to just delete the rows I don't need in the file? Otherwise, I'm also thankful for any kind of help on that problem.

Many thanks in advance.

Kind regards, Elena


Solution

  • This is the full working code for you, The main issue was on your csv format that you're getting from online, Full code: https://rextester.com/QUGF24653

    What I did:

    1. I did skip first 10 rows
    2. Set column names after using a 3 space separator
    3. Split last column to 2 new columns
    4. Filter Form Type with "10-K"
    import requests
    import os
    import pandas as pd
    
    years = [2016]
    quarters = ['QTR1']
    base_path = '/Users/xyz/Desktop'
    current_dirs = os.listdir(path=base_path)
    
    for yr in years:
        if str(yr) not in current_dirs:
            os.mkdir('/'.join([base_path, str(yr)]))
    
        current_files = os.listdir('/'.join([base_path, str(yr)]))
    
        for qtr in quarters:
            local_filename = f'{yr}-{qtr}.csv'
    
            local_file_path = '/'.join([base_path, str(yr), local_filename])
    
            if local_filename in current_files:
                print(f'Skipping file for {yr}, {qtr} because it is already saved.')
                continue
    
            url = f'https://www.sec.gov/Archives/edgar/full-index/{yr}/{qtr}/form.idx'
    
            r = requests.get(url, stream=True)
            with open(local_file_path, 'wb') as f:
                for chunk in r.iter_content(chunk_size=128):
                    f.write(chunk)
    
    colnames=['Form Type', 'Company Name', 'CIK', 'Date Filed','File Name']
    r2 = pd.read_csv('/Users/xyz/Desktop/2016-QTR1.csv', sep=r'\s{3,}', skiprows=10, encoding="utf-8", names=colnames,header=None)
    r2[['Date Filed','File Name']] = r2['Date Filed'].str.split(expand=True)
    filtered = (r2['Form Type'] == '10-K')
    r2_10K = r2.loc[filtered]
    print(r2_10K.head())
    

    Output:

       Form Type                            Company Name      CIK  Date Filed                                    File Name
    2181      10-K                       1347 Capital Corp  1606163  2016-03-21  edgar/data/1606163/0001144204-16-089184.txt
    2182      10-K  1347 Property Insurance Holdings, Inc.  1591890  2016-03-17  edgar/data/1591890/0001387131-16-004603.txt
    2183      10-K                1ST CONSTITUTION BANCORP  1141807  2016-03-22  edgar/data/1141807/0001141807-16-000010.txt
    2184      10-K                         1ST SOURCE CORP    34782  2016-02-19    edgar/data/34782/0000034782-16-000102.txt
    2185      10-K            1st Century Bancshares, Inc.  1420525  2016-03-04  edgar/data/1420525/0001437749-16-026765.txt