[SOLVED] Reading a multi-part 7z file using Python is failing due to memory issues

Reading a multi-part 7z file using Python is failing due to memory issues

I am using a loop to read multi-archive 7z files with this code.

import py7zr
import multivolumefile

zip_path = f"{ARCHIVE_PATH}/test.7z"

with multivolumefile.open(zip_path, mode='rb') as multizip_handler:
    with py7zr.SevenZipFile(multizip_handler, 'r', password=PASSWORD, filters=filters) as zip_handler:
        for fname, fcontent in zip_handler.read(targets=None).items():
            pass

The archive is relatively large (73 parts with a total size of 700 Mb). I have noticed that the memory footprint is quite high (even without storing in memory any variable content like fname or fcontent). This loop is working, but if I intentionnaly fill the memory with commands such as head -c 7G /dev/zero | tail, the loop is giving me a CRC Error (while actually the archive is fine tested with the 7z command). The loop is quite simple and use only library functions, so I cannot make it lighter than it is.

EDIT: to be more precise:

For some archives the loop is totally failing
For some others, the loop is working, and I can deduce it is a memory issue by filling the memory and watching the loop failing (the code and the archive being the same). Filling the memory was done in a way enough space remains (let's say 1 Gb).

So my guess is that one of the two libraries multivolumefile or py7zr is internally consuming a lot of memory.

Is there a way to reduce the memory footprint so we can ensure reading a multipart archive always success independently from the size of the archive or the size of the files inside the archive?

Solution

After many tests, this looks probably a bug, a bug report has been submited: https://github.com/miurahr/py7zr/issues/575