pythoncompressionbz2

How to get the time needed for decompressing large bz2 files?


I need to process large bz2 files (~6G) using Python, by decompressing it line-by-line, using BZ2File.readline(). The problem is that I want to know how much time is needed for processing the whole file.

I did a lot searches, tried to get the actual size of decompressed file, so that I can know the percentage processed on-the-fly, and hence the time remaining, while the finding is that it seems impossible to know the decompressed file size without decompressing it first (https://stackoverflow.com/a/12647847/7876675).

Besides that decompressing the file takes loads of memory, decompressing takes a lot of time itself. So, can anybody help me to get the remaining processing time on-the-fly?


Solution

  • You can estimate the time remaining based on the consumption of compressed data, instead of the production of uncompressed data. The result will be about the same, if the data is relatively homogenous. (If it isn't, then either using the input or the output won't give an accurate estimate anyway.)

    You can easily find the size of the compressed file, and use the time spent on the compressed data so far to estimate the time to process the remaining compressed data.

    Here is a simple example of using a BZ2Decompress object to operate on the input a chunk at a time, showing the read progress (Python 3, getting the file name from the command line):

    # Decompress a bzip2 file, showing progress based on consumed input.
    
    import sys
    import os
    import bz2
    import time
    
    def proc(input):
        """Decompress and process a piece of a compressed stream"""
        dat = dec.decompress(input)
        got = len(dat)
        if got != 0:    # 0 is common -- waiting for a bzip2 block
            # process dat here
            pass
        return got
    
    # Get the size of the compressed bzip2 file.
    path = sys.argv[1]
    size = os.path.getsize(path)
    
    # Decompress CHUNK bytes at a time.
    CHUNK = 16384
    totin = 0
    totout = 0
    prev = -1
    dec = bz2.BZ2Decompressor()
    start = time.time()
    with open(path, 'rb') as f:
        for chunk in iter(lambda: f.read(CHUNK), b''):
            # feed chunk to decompressor
            got = proc(chunk)
    
            # handle case of concatenated bz2 streams
            if dec.eof:
                rem = dec.unused_data
                dec = bz2.BZ2Decompressor()
                got += proc(rem)
    
            # show progress
            totin += len(chunk)
            totout += got
            if got != 0:    # only if a bzip2 block emitted
                frac = round(1000 * totin / size)
                if frac != prev:
                    left = (size / totin - 1) * (time.time() - start)
                    print(f'\r{frac / 10:.1f}% (~{left:.1f}s left) ', end='')
                    prev = frac
    
    # Show the resulting size.
    print(end='\r')
    print(totout, 'uncompressed bytes')