pythonpython-3.xpython-zipfile

Large Zip Files with Zipfile Module Python


I have never used the zip file module before. I have a directory that contains thousands of zip files i need to process. These files can be up to 6GB big. I have looked through some documentation but a lot of them are not clear on what the best methods are for reading large zip files without needing to extract.

I stumbled up this: Read a large zipped text file line by line in python

So in my solution I tried to emulate it and use it like I would reading a normal text file with the with open function

with open(odfslogp_obj, 'rb', buffering=102400) as odfslog

So I wrote the following based off the answer from that link:

for odfslogp_obj in odfslogs_plist:
    with zipfile.ZipFile(odfslogp_obj, mode='r') as z:
        with z.open(buffering=102400) as f:
            for line in f:
                print(line)

But this gives me an "unexpected keyword" error for z.open()

Question is, is there documentation that explains what keywords, the z.open() function would take? I only found one for the .ZipFile() function.

I wanna make sure my code isn't using up too much memory while processing these files line by line.

odfslogp_obj is a Path object btw

When I take off the buffering and just have z.open(), I get an error saying: TypeError: open() missing 1 required positional argument: 'name'


Solution

  • Once you've opened the zipfile, you still need to open the individual files it contains. That is the second z.open you had problems with. It's not the builtin python open and it doesn't have a "buffering" parameter. See ZipFile.open

    Once the zipfile is opened you can enumerate its files and open them in turn. ZipFile.open opens in binary mode, which may be a different problem, depending on what you want to do with the file.

    for odfslogp_obj in odfslogs_plist:
        with zipfile.ZipFile(odfslogp_obj, mode='r') as z:
            for name in z.namelist():
                with z.open(name) as f:
                    for line in f:
                        print(line)