I downloaded a bunch of 10-ks from edgar. I need to keep only 10-k reports with keywords "cryptocurrency" and "blockchain". Each company has one single folder. However, I was stuck on the reading txt file from multiple folders. Below are my coding:
Step 1 (this part works well and generate correct directories)
import os
import pandas as pd
path = 'C:/test/2014/QTR1/'
words = ['cryptocurrency', 'blockchain']
filelist = os.listdir(path)
Path2 = []
for x in filelist:
Path2.append(path + x+ '/')
print(Path2)
Step 2:
for i in Path2:
filelist2 = os.listdir(i)
for j in filelist2:
if j.endswith('.txt'):
each_file_content = open(j, 'r', encoding="utf-8").read()
if not any(word in each_file_content for word in words):
os.unlink(j)
After running, Jupyter noticed me below errors:
FileNotFoundError Traceback (most recent call last) Input In [43], in <cell line: 1>() 3 for j in filelist2: 4 if j.endswith('.txt'): ----> 6 each_file_content = open(j, 'r', encoding="utf-8").read() 7 if not any(word in each_file_content for word in words): 8 os.unlink(j)
FileNotFoundError: [Errno 2] No such file or directory: '0001000180-14-000019.txt'
Could anyone please help me revise the above coding or any other idea how to fulfill the task I mentioned? Thank you in advance!
I hope to delete files that do not contain the two keywords, any suggestions will be helpful!
Using pathlib (rglob recursive search):
from pathlib import Path
words = ["cryptocurrency", "blockchain"]
files = Path("C:/test/2014/QTR1/").rglob("*.txt")
for file in files:
each_file_content = open(file, "r", encoding="utf-8").read()
if any(word not in each_file_content for word in words):
file.unlink()