pythonenumerateline-count

(Python) Counting lines in a huge (>10GB) file as fast as possible


I have a really simple script right now that counts lines in a text file using enumerate():

i = 0
f = open("C:/Users/guest/Desktop/file.log", "r")
for i, line in enumerate(f):
      pass
print i + 1
f.close()

This takes around 3 and a half minutes to go through a 15GB log file with ~30 million lines. It would be great if I could get this under two minutes or less, because these are daily logs and we want to do a monthly analysis, so the code will have to process 30 logs of ~15GB - more than one and a half hour possibly, and we'd like to minimise the time & memory load on the server.

I would also settle for a good approximation/estimation method, but it needs to be about 4 sig fig accurate...

Thank you!


Solution

  • Ignacio's answer is correct, but might fail if you have a 32 bit process.

    But maybe it could be useful to read the file block-wise and then count the \n characters in each block.

    def blocks(files, size=65536):
        while True:
            b = files.read(size)
            if not b: break
            yield b
    
    with open("file", "r") as f:
        print sum(bl.count("\n") for bl in blocks(f))
    

    will do your job.

    Note that I don't open the file as binary, so the \r\n will be converted to \n, making the counting more reliable.

    For Python 3, and to make it more robust, for reading files with all kinds of characters:

    def blocks(files, size=65536):
        while True:
            b = files.read(size)
            if not b: break
            yield b
    
    with open("file", "r",encoding="utf-8",errors='ignore') as f:
        print (sum(bl.count("\n") for bl in blocks(f)))