pythonpython-3.xquery-optimizationlarge-datalarge-data-volumes

Tips for working with large quantity .txt files (and overall large size) - python?


I'm working on a script to parse txt files and store them into a pandas dataframe that I can export to a CSV.

My script works easily when I was using <100 of my files - but now when trying to run it on the full sample, I'm running into a lot of issues.

Im dealing with ~8000 .txt files with an average size of 300 KB, so in total about 2.5 GB in size.

I was wondering if I could get tips on how to make my code more efficient.

for opening and reading files, I use:

filenames = os.listdir('.')
dict = {}
for file in filenames:
    with open(file) as f:
        contents = f.read()
        dict[file.replace(".txt", "")] = contents

Doing print(dict) crashes (at least it seems like it) my python. Is there a better way to handle this?

Additionally, I also convert all the values in my dict to lowercase, using:

def lower_dict(d):
   lcase_dict = dict((k, v.lower()) for k, v in d.items())
   return lcase_dict
lower = lower_dict(dict)

I haven't tried this yet (can't get passed the opening/reading stage), but I was wondering if this would cause problems?

Now, before I am marked as duplicate, I did read this: How can I read large text files in Python, line by line, without loading it into memory?

however, that user seemed to be working with 1 very large file which was 5GB, whereas I am working with multiple small files totalling 2.5GB (and actually my ENTIRE sample is something like 50GB and 60,000 files). So I was wondering if my approach would need to be different. Sorry if this is a dumb question, unfortunately, I am not well versed in the field of RAM and computer processing methods.

Any help is very much appreciated.

thanks


Solution

  • I believe the thing slowing your code down the most is the .replace() method your are using. I believe this is because the built-in replace method is iterative, and as a result is very inefficient. Try using the re module in your for loops. Here is an example of how I used the module recently to replace the keys "T", ":" and "-" with "" which in this case removed them from the file:

    for line in lines:
        line = re.sub('[T:-]', '', line)
    

    Let me know if this helps!