pythonfileseek

What is the most efficient way to get first and last line of a text file?


I have a text file which contains a time stamp on each line. My goal is to find the time range. All the times are in order so the first line will be the earliest time and the last line will be the latest time. I only need the very first and very last line. What would be the most efficient way to get these lines in python?

Note: These files are relatively large in length, about 1-2 million lines each and I have to do this for several hundred files.


Solution

  • docs for io module

    with open(fname, 'rb') as fh:
        first = next(fh).decode()
    
        fh.seek(-1024, 2)
        last = fh.readlines()[-1].decode()
    

    The variable value here is 1024: it represents the average string length. I choose 1024 only for example. If you have an estimate of average line length you could just use that value times 2.

    Since you have no idea whatsoever about the possible upper bound for the line length, the obvious solution would be to loop over the file:

    for line in fh:
        pass
    last = line
    

    You don't need to bother with the binary flag you could just use open(fname).

    ETA: Since you have many files to work on, you could create a sample of couple of dozens of files using random.sample and run this code on them to determine length of last line. With an a priori large value of the position shift (let say 1 MB). This will help you to estimate the value for the full run.