pythonpython-zipfile

ZipFile.infolist() return wrong ordering than actual zipfile ordering. Any workaround suggest?


I am trying to get files inside zipfile and renaming them from 1 to n when extarcting. But when the actual filenames also start from 1 to n ZipFile.infolist() return wrong ordering.

This is how I trying to get result:

with ZipFile(file) as zf:
        for i, file in enumerate(zf.infolist(), 1):
            file.filename = f'{i}.{file.filename.split(".")[-1]}'
            zf.extract(file, file_path)

And this is how the actual files ordering look like: enter image description here

When I debug to code process ZipFile.infolist() return a list containing a ZipInfo objects like this: enter image description here

As you can see from the images, actual ordering is like 1,2,3,4,5,6,7,8...n. But ZipFile.infolist() return it like 1,10,11,11,12,13...n

Am I doing it wrong? Or is there any workaround? I think in the worst case I should name actual file names in zipfile from 01, 02, 03, 04 to n. But this is unreliable solution.


Solution

  • The program you are using to display the contents of your zip file is sorting the filenames numerically before it displays them to you. The extraction with python is done in the order they are stored in the zip file.

    Here is a worked example that shows the issue.

    First create some files

    $ touch 1.jpg 2.jpg 3.jpg 4.jpg  7.jpg 8.jpg 10.jpg 21.jpg 100.jp
    g 205.jpg 
    

    Add them to a zip file in a random order

    $ zip test.zip 10.jpg 2.jpg 1.jpg 100.jpg 21.jpg 8.jpg 205.jpg 7.
    jpg 3.jpg 
      adding: 10.jpg (stored 0%)
      adding: 2.jpg (stored 0%)
      adding: 1.jpg (stored 0%)
      adding: 100.jpg (stored 0%)
      adding: 21.jpg (stored 0%)
      adding: 8.jpg (stored 0%)
      adding: 205.jpg (stored 0%)
      adding: 7.jpg (stored 0%)
      adding: 3.jpg (stored 0%)```
    

    Check what unzip thinks is in the file

    $ unzip -l test.zip 
    Archive:  test.zip
      Length      Date    Time    Name
    ---------  ---------- -----   ----
            0  2023-10-01 15:44   10.jpg
            0  2023-10-01 15:44   2.jpg
            0  2023-10-01 15:44   1.jpg
            0  2023-10-01 15:44   100.jpg
            0  2023-10-01 15:44   21.jpg
            0  2023-10-01 15:44   8.jpg
            0  2023-10-01 15:44   205.jpg
            0  2023-10-01 15:44   7.jpg
            0  2023-10-01 15:44   3.jpg
    ---------                     -------
            0                     9 files
    

    It displays them in the order they were added.

    Now print contents with python

    import zipfile
    
    zipfilename = "test.zip"
    with zipfile.ZipFile(zipfilename) as zf:
            for file in zf.namelist():
                print(file)
    

    the code outputs this

    $ python try.py 
    10.jpg
    2.jpg
    1.jpg
    100.jpg
    21.jpg
    8.jpg
    205.jpg
    7.jpg
    3.jpg
    

    That matches the file insertion order

    You can work around this by sorting the contents of the zip file yourself. The key points about your files are

    The python code can use the sorted function to sort the file by filename

    import zipfile
    from pathlib import Path
    
    print()
    zipfilename = "test1.zip"
    with zipfile.ZipFile(zipfilename) as zf:
            for file in sorted(zf.namelist(), key=lambda x: int(Path(x).stem)):
                print(file)
    

    this outputs the files in order

    $ python try.py 
    
    1.jpg
    2.jpg
    3.jpg
    7.jpg
    8.jpg
    10.jpg
    21.jpg
    100.jpg
    205.jpg
    

    Let's unpick the line with the sorted function

    for file in sorted(zf.namelist(), key=lambda x: int(Path(x).stem)):
    

    The sorted function is given two parameters:

    1. the list of filenames to sort via zf.namelist()

    2. a function that works out the key to be used in sorting, lambda x: int(Path(x).stem).

      The call to Path(x).stem takes full filename (e.g. 205.jpg) and returns the filename without the extension (e.g. 205).

      The int converts the string 205 into the integer value 205. That value is then returned to sorted and allows the filenames to be sorted numerically.