I'm working on a script to parse txt files and store them into a pandas dataframe that I can export to a CSV.
My script works easily when I was using <100 of my files - but now when trying to run it on the full sample, I'm running into a lot of issues.
Im dealing with ~8000 .txt files with an average size of 300 KB, so in total about 2.5 GB in size.
I was wondering if I could get tips on how to make my code more efficient.
for opening and reading files, I use:
filenames = os.listdir('.')
dict = {}
for file in filenames:
with open(file) as f:
contents = f.read()
dict[file.replace(".txt", "")] = contents
Doing print(dict)
crashes (at least it seems like it) my python.
Is there a better way to handle this?
Additionally, I also convert all the values
in my dict
to lowercase, using:
def lower_dict(d):
lcase_dict = dict((k, v.lower()) for k, v in d.items())
return lcase_dict
lower = lower_dict(dict)
I haven't tried this yet (can't get passed the opening/reading stage), but I was wondering if this would cause problems?
Now, before I am marked as duplicate, I did read this: How can I read large text files in Python, line by line, without loading it into memory?
however, that user seemed to be working with 1 very large file which was 5GB, whereas I am working with multiple small files totalling 2.5GB (and actually my ENTIRE sample is something like 50GB and 60,000 files). So I was wondering if my approach would need to be different. Sorry if this is a dumb question, unfortunately, I am not well versed in the field of RAM and computer processing methods.
Any help is very much appreciated.
thanks
I believe the thing slowing your code down the most is the .replace() method your are using. I believe this is because the built-in replace method is iterative, and as a result is very inefficient. Try using the re module in your for loops. Here is an example of how I used the module recently to replace the keys "T", ":" and "-" with "" which in this case removed them from the file:
for line in lines:
line = re.sub('[T:-]', '', line)
Let me know if this helps!