pythoncsvtell

python csv distorts tell


I am trying to find a percent of where I am when reading through a csv file. I know how I could do this using tell() with a file object, but when I read that file object using csv.reader, then do a for loop on the rows in my reader object, the tell() function always returns as if it is at the end of the file, no matter where I am in the loop. How can I find where I am?

Current code:

with open(FILE_PERSON, 'rb') as csvfile:
    spamreader = csv.reader(csvfile)
    justtesting = csvfile.tell()
    size = os.fstat(csvfile.fileno()).st_size
    for row in spamreader:
        pos = csvfile.tell()
        print pos, "of", size, "|", justtesting

I threw "justtesting" in there just to prove that tell() does return 0 until I start my for loop.

This will return the same thing for every row in my csv file: 579 of 579 | 0

What am I doing wrong?


Solution

  • The csv library utilizes a buffer when reading your file, so the file pointer jumps in larger blocks. It does not read your file line-by-line.

    It reads the data in larger chunks to make parsing easier, and because newlines could be embedded in quotes, reading CSV data line-by-line would not work.

    If you have to give a progress report, then you need to pre-count the number of lines. The following will only work if your input CSV file does not embed newlines in column values:

    with open(FILE_PERSON, 'rb') as csvfile:
        linecount = sum(1 for _ in csvfile)
        csvfile.seek(0)
        spamreader = csv.reader(csvfile)
        for line, row in enumerate(spamreader):
            print '{} of {}'.format(line, linecount)
    

    There are other methods to count the number of lines (see How to get line count cheaply in Python?) but since you'll be reading the file anyway to process it as a CSV, you may as well make use of the open file you have for that. I'm not certain that opening the file as a memory map, then read it as a normal file again is going to perform any better.