pythonzipgzipunziptmp

unzip file without creating temporary files


I download a zip file from AWS S3 and unzip it. Upon unzipping, all files are saved in the tmp/ folder.

s3 = boto3.client('s3')

s3.download_file('testunzipping','DataPump_10000838.zip','/tmp/DataPump_10000838.zip')

with zipfile.ZipFile('/tmp/DataPump_10000838.zip', 'r') as zip_ref:
    zip_ref.extractall('/tmp/')
    lstNEW = zip_ref.namelist()

The output of listNEW is something like this:

['DataPump_10000838/', '__MACOSX/._DataPump_10000838', 'DataPump_10000838/DockBooking', '__MACOSX/DataPump_10000838/._DockBooking', 'DataPump_10000838/LoadEquipment', '__MACOSX/DataPump_10000838/._LoadEquipment', ....]

LoadEquipment and DockBooking are files but the rest are not. Is it possible to unzip the file without creating those temporary files? Or is I possible to filter out the real files? Because later, I need to use the correct files and gzip them.

$item_$unixepochtimestamp.csv.gz

Do I use the compress function?


Solution

  • To only extract certain files, you can pass a list to extractall:

    with zipfile.ZipFile('/tmp/DataPump_10000838.zip', 'r') as zip_ref:
        lstNEW = list(filter(lambda x: not x.startswith("__MACOSX/"), zip_ref.namelist()))
        zip_ref.extractall('/tmp/', members=lstNEW)
    

    The files are not temporary files, but rather macOS's way of representing resource forks in zip files that don't normally support this.