I need to process large bz2 files (~6G) using Python, by decompressing it line-by-line, using BZ2File.readline()
. The problem is that I want to know how much time is needed for processing the whole file.
I did a lot searches, tried to get the actual size of decompressed file, so that I can know the percentage processed on-the-fly, and hence the time remaining, while the finding is that it seems impossible to know the decompressed file size without decompressing it first (https://stackoverflow.com/a/12647847/7876675).
Besides that decompressing the file takes loads of memory, decompressing takes a lot of time itself. So, can anybody help me to get the remaining processing time on-the-fly?
You can estimate the time remaining based on the consumption of compressed data, instead of the production of uncompressed data. The result will be about the same, if the data is relatively homogenous. (If it isn't, then either using the input or the output won't give an accurate estimate anyway.)
You can easily find the size of the compressed file, and use the time spent on the compressed data so far to estimate the time to process the remaining compressed data.
Here is a simple example of using a BZ2Decompress
object to operate on the input a chunk at a time, showing the read progress (Python 3, getting the file name from the command line):
# Decompress a bzip2 file, showing progress based on consumed input.
import sys
import os
import bz2
import time
def proc(input):
"""Decompress and process a piece of a compressed stream"""
dat = dec.decompress(input)
got = len(dat)
if got != 0: # 0 is common -- waiting for a bzip2 block
# process dat here
pass
return got
# Get the size of the compressed bzip2 file.
path = sys.argv[1]
size = os.path.getsize(path)
# Decompress CHUNK bytes at a time.
CHUNK = 16384
totin = 0
totout = 0
prev = -1
dec = bz2.BZ2Decompressor()
start = time.time()
with open(path, 'rb') as f:
for chunk in iter(lambda: f.read(CHUNK), b''):
# feed chunk to decompressor
got = proc(chunk)
# handle case of concatenated bz2 streams
if dec.eof:
rem = dec.unused_data
dec = bz2.BZ2Decompressor()
got += proc(rem)
# show progress
totin += len(chunk)
totout += got
if got != 0: # only if a bzip2 block emitted
frac = round(1000 * totin / size)
if frac != prev:
left = (size / totin - 1) * (time.time() - start)
print(f'\r{frac / 10:.1f}% (~{left:.1f}s left) ', end='')
prev = frac
# Show the resulting size.
print(end='\r')
print(totout, 'uncompressed bytes')