I'm working with a huge .7z file that I need to process line by line.
First I tried py7zr
, but it only works by first decompressing the whole file into an object. This runs out of memory.
Then libarchive
is able to read block by block, but there's no straightforward way of splitting these binary blocks into lines.
What can I do?
Related questions I researched first:
I'm looking for ways to improve the temporary solution I built myself - posted as an answer here. Thanks!
This solution goes through all available get_blocks()
. If the last line doesn't end in \n
, we keep the remaining bytes to be yield
on the next block.
import libarchive
def process(my_file):
data = ''
with libarchive.file_reader(my_file) as e:
for entry in e:
for block in entry.get_blocks():
data += block.decode('ISO-8859-1')
lines = data.splitlines()
if not data.endswith('\n'):
data = lines.pop()
else:
data = ''
for line in lines:
yield ({'l': line},)