pythonlistcsvdictionaryslurp

slurp/csv/loop a file to create a list of dictionaries


I have a large file (1.6 gigs) with millions of rows that has columns delimited with:

[||]

I have tried to use the csv module but it says I can only use a single character as a delimiter. So Here is what I have:

fileHandle = open('test.txt', 'r', encoding="UTF-16")

thelist = []

for line in fileHandle:
    fields = line.split('[||]')

    therow = {
        'dea_reg_nbr':fields[0],
        'bus_actvty_cd':fields[1],
        'drug_schd':fields[3],
        #50 more columns like this
    }
    thelist.append(therow)

fileHandle.close()

#now I have thelist which is what I want

And boom, now I have a list of dictionaries and it works. I want a list because I care about the order, and the dictionary because downstream it's being expected. This just feels like I should be taking advantage of something more efficient. I don't think this scales well with over a million rows and so much data. So, my question as follows:

What would be the more efficient way of taking a multi-character delimited text file (UTF-16 encoded) and creating a list of dictionaries?

Any thoughts would be appreciated!


Solution

  • One way to make it scale better is to use a generator instead of loading all million rows into memory at once. This may or may not be possible depending on your use-case; it will work best if you only need to make one pass over the full data set. Multiple passes will require you to either store all the data in memory in some form or another or to read it from the file multiple times.

    Anyway, here's an example of how you could use a generator for this problem:

    def file_records():
        with open('test.txt', 'r', encoding='UTF-16') as fileHandle:
            for line in fileHandle:
                fields = line.split('[||]')
                therow = {
                    'dea_reg_nbr':fields[0],
                    'bus_actvty_cd':fields[1],
                    'drug_schd':fields[3],
                    #50 more columns like this
                }
                yield therow
    
    for record in file_records():
        # do work on one record
    

    The function file_records is a generator function because of the yield keyword. When this function is called, it returns an iterator that you can iterate over exactly like a list. The records will be returned in order, and each one will be a dictionary.

    If you're unfamiliar with generators, this is a good place to start reading about them.

    The thing that makes this scale so well is that you will only ever have one therow in memory at a time. Essentially what's happening is that at the beginning of every iteration of the loop, the file_records function is reading the next line of the file and returning the computed record. It will wait until the next row is needed before doing the work, and the previous record won't linger in memory unless it's needed (such as if it's referenced in whatever data structure you build in # do work on one record).

    Note also that I moved the open call to a with statement. This will ensure that the file gets closed and all related resources are freed once the iteration is done or an exception is raised. This is much simpler than trying to catch all those cases yourself and calling fileHandle.close().