pythonfileiogzipseek

How to efficiently read the last line of very big gzipped log file?


I'd like to get the last line from a big gzipped log file, without having to iterate on all other lines, because it's a big file.

I have read Print Last Line of File Read In with Python and in particular this answer for big files, but it does not work for gzipped file. Indeed, I tried:

import gzip

with gzip.open(f, 'rb') as g:
    g.seek(-2, os.SEEK_END) 
    while g.read(1) != b'\n':  # Keep reading backward until you find the next break-line
        g.seek(-2, os.SEEK_CUR) 
    print(g.readline().decode())

but it already takes more than 80 seconds for a 10 MB compressed / 130 MB decompressed file, on my very standard laptop!

Question: how to seek efficiently to the last line in a gzipped file, with Python?


Side-remark: if not gzipped, this method is very fast: 1 millisecond for a 130 MB file:

import os, time
t0 = time.time()
with open('test', 'rb') as g:
    g.seek(-2, os.SEEK_END) 
    while g.read(1) != b'\n': 
        g.seek(-2, os.SEEK_CUR) 
    print(g.readline().decode())
print(time.time() - t0)    

Solution

  • If you have no control over the generation of the gzip file, then there is no way to read the last line of the uncompressed data without decoding all of the lines. The time it takes will be O(n), where n is the size of the file. There is no way to make it O(1).

    If you do have control on the compression end, then you can create a gzip file that facilitates random access, and you can also keep track of random access entry points to enable jumping to the end of the file.