pythonstring7zipin-memorylibarchive

read a .7z file in memory with Python, and process each line as a stream


I'm working with a huge .7z file that I need to process line by line.

First I tried py7zr, but it only works by first decompressing the whole file into an object. This runs out of memory.

Then libarchive is able to read block by block, but there's no straightforward way of splitting these binary blocks into lines.

What can I do?

Related questions I researched first:

I'm looking for ways to improve the temporary solution I built myself - posted as an answer here. Thanks!


Solution

  • This solution goes through all available get_blocks(). If the last line doesn't end in \n, we keep the remaining bytes to be yield on the next block.

    import libarchive
    
    def process(my_file):
        data = ''
        with libarchive.file_reader(my_file) as e:
            for entry in e:
                for block in entry.get_blocks():
                    data += block.decode('ISO-8859-1')
                    lines = data.splitlines()
                    if not data.endswith('\n'):
                        data = lines.pop()
                    else:
                        data = ''
                    for line in lines:
                        yield ({'l': line},)