pandasfile-readedgar

Delete txt file based on keywords from multiple folders


I downloaded a bunch of 10-ks from edgar. I need to keep only 10-k reports with keywords "cryptocurrency" and "blockchain". Each company has one single folder. However, I was stuck on the reading txt file from multiple folders. Below are my coding:

Step 1 (this part works well and generate correct directories)

import os
import pandas as pd

path = 'C:/test/2014/QTR1/'
words = ['cryptocurrency', 'blockchain']

filelist = os.listdir(path)

Path2 = []
for x in filelist:
    Path2.append(path + x+ '/')
print(Path2)

Step 2:

for i in Path2:
    filelist2 = os.listdir(i)
    for j in filelist2:
        if j.endswith('.txt'):
                
                each_file_content = open(j, 'r', encoding="utf-8").read()
                if not any(word in each_file_content for word in words):
                    os.unlink(j)

After running, Jupyter noticed me below errors:

FileNotFoundError Traceback (most recent call last) Input In [43], in <cell line: 1>() 3 for j in filelist2: 4 if j.endswith('.txt'): ----> 6 each_file_content = open(j, 'r', encoding="utf-8").read() 7 if not any(word in each_file_content for word in words): 8 os.unlink(j)

FileNotFoundError: [Errno 2] No such file or directory: '0001000180-14-000019.txt'

Could anyone please help me revise the above coding or any other idea how to fulfill the task I mentioned? Thank you in advance!

I hope to delete files that do not contain the two keywords, any suggestions will be helpful!


Solution

  • Using pathlib (rglob recursive search):

    from pathlib import Path
    
    
    words = ["cryptocurrency", "blockchain"]
    files = Path("C:/test/2014/QTR1/").rglob("*.txt")
    for file in files:
        each_file_content = open(file, "r", encoding="utf-8").read()
        if any(word not in each_file_content for word in words):
            file.unlink()