pythonpython-2.7filecsv

Reading a huge .csv file


I'm currently trying to read data from .csv files in Python 2.7 with up to 1 million rows, and 200 columns (files range from 100mb to 1.6gb). I can do this (very slowly) for the files with under 300,000 rows, but once I go above that I get memory errors. My code looks like this:

def getdata(filename, criteria):
    data=[]
    for criterion in criteria:
        data.append(getstuff(filename, criteron))
    return data

def getstuff(filename, criterion):
    import csv
    data=[]
    with open(filename, "rb") as csvfile:
        datareader=csv.reader(csvfile)
        for row in datareader: 
            if row[3]=="column header":
                data.append(row)
            elif len(data)<2 and row[3]!=criterion:
                pass
            elif row[3]==criterion:
                data.append(row)
            else:
                return data

The reason for the else clause in the getstuff function is that all the elements which fit the criterion will be listed together in the csv file, so I leave the loop when I get past them to save time.

My questions are:

  1. How can I manage to get this to work with the bigger files?

  2. Is there any way I can make it faster?

My computer has 8gb RAM, running 64bit Windows 7, and the processor is 3.40 GHz (not certain what information you need).


Solution

  • You are reading all rows into a list, then processing that list. Don't do that.

    Process your rows as you produce them. If you need to filter the data first, use a generator function:

    import csv
    
    def getstuff(filename, criterion):
        with open(filename, "rb") as csvfile:
            datareader = csv.reader(csvfile)
            yield next(datareader)  # yield the header row
            count = 0
            for row in datareader:
                if row[3] == criterion:
                    yield row
                    count += 1
                elif count:
                    # done when having read a consecutive series of rows 
                    return
    

    I also simplified your filter test; the logic is the same but more concise.

    Because you are only matching a single sequence of rows matching the criterion, you could also use:

    import csv
    from itertools import dropwhile, takewhile
    
    def getstuff(filename, criterion):
        with open(filename, "rb") as csvfile:
            datareader = csv.reader(csvfile)
            yield next(datareader)  # yield the header row
            # first row, plus any subsequent rows that match, then stop
            # reading altogether
            # Python 2: use `for row in takewhile(...): yield row` instead
            # instead of `yield from takewhile(...)`.
            yield from takewhile(
                lambda r: r[3] == criterion,
                dropwhile(lambda r: r[3] != criterion, datareader))
            return
    

    You can now loop over getstuff() directly. Do the same in getdata():

    def getdata(filename, criteria):
        for criterion in criteria:
            for row in getstuff(filename, criterion):
                yield row
    

    Now loop directly over getdata() in your code:

    for row in getdata(somefilename, sequence_of_criteria):
        # process row
    

    You now only hold one row in memory, instead of your thousands of lines per criterion.

    yield makes a function a generator function, which means it won't do any work until you start looping over it.