pythonshutilpython-ospython-zipfile

How can I extract and rename specific files from daily compressed zip folders using Python?


Let me preface this by I have never worked with python before but have spent the past several days researching and watching videos. Although I've found solutions that helps with components of this, I'm having a hard time filling in the gaps since it's not the exact situation. I'm not just looking for code, I'm hoping to understand and learn so please provide any resources that would help.

I receive daily compressed zip folders labeled "v1_csv_2022_06_02_2023_etc.zip" within those folders is a sub-folder labeled "export" within the "export" folder contains over 30 csv's. I only care about four of those csv's based on if they have certain keywords in the filename. Ex: only pull csv's that have the keyword "sales", "activity"

The second component of this is when I try to unzip all the files, of course they overwrite and only show the current's day data. I'm assuming based on research this is due to the filenames being the same through each daily compressed zip folder. So I am looking for a way to rename these files and then append them all together. It would be nice to append the zip folder naming convention "2022_06_2023) to the filename. ex: "2022_06_02_2020_sales.csv" end state being 1 master sales csv

Folder Structure Users --> username --> documents --> sales & activities --> zip folder (daily zip folders compressed with date in file name) --> export folder ---> 30+ csv files

import os
import zipfile

# Path to the directory containing the zip files
zip_directory = '/path/to/zip/files'

# Iterate over all files in the zip directory
for filename in os.listdir(zip_directory):
    if filename.endswith('.zip'):
        # Construct the full path to the zip file
        zip_path = os.path.join(zip_directory, filename)

        # Open the zip file
        with zipfile.ZipFile(zip_path, 'r') as zip_ref:
            # Extract all files to a folder named 'export'
            zip_ref.extractall(os.path.join(zip_directory, 'export'))

        # Get the base name of the zip file without the extension
        zip_base_name = os.path.splitext(filename)[0]

        # Iterate over all extracted files in the 'export' folder
        export_directory = os.path.join(zip_directory, 'export')
        for extracted_file in os.listdir(export_directory):
            # Construct the full path to the extracted file
            extracted_file_path = os.path.join(export_directory, extracted_file)

            # Rename the extracted file with the zip file's base name
            new_filename = f'{zip_base_name}_{extracted_file}'
            new_file_path = os.path.join(export_directory, new_filename)
            os.rename(extracted_file_path, new_file_path)

I've tried variations of this, but to be honest I've tried a lot of code over the past few days so not sure what's relevant to include


Solution

  • The problem is that when you are extracting, you aren't creating folders based on that extract name. A few things to help:

    import zipfile
    from pathlib import Path
    
    # Path to the directory containing the zip files
    zip_directory = Path('/path/to/zip/files')
    
    # Make an exports directory if you want, exist_ok so it
    # doesn't throw errors trying to create it twice, and parents
    # so you can make some new parent dirs if you want
    unload_dir = (
        Path('/path/to/exports')
        .mkdir(exist_ok=True, parents=True)
    )
    
    # Find all zip files
    files = zip_directory.glob('*.zip')
    
    for file in files:
        # take the file name with no extension
        exclusive_name = file.stem
    
        # mark the new subdirectory
        subdir = unload_dir / exclusive_name
        
        # Make new folder
        subdir.mkdir(exist_ok=True)
    
        with zipfile.Zipfile(file, 'r') as zf:
            # extract them to the subdir
            zf.extractall(subdir, 'export')
    

    Now the subdirectory has your naming convention, extracting to /path/to/exports/v1_csv_2022_06_02_2023_etc/*.csv. The only way it will be overwritten is if you extract the same zip file.