pythongzipfilenamesunzipgunzip

Python: Extract gz files with and honor original filenames and file extensions


Under a folder, I have many .gz files and within these gz files some are .txt, some are .csv, some are .xml, or some other extensions.

E.g. gz (the original/compressed file in()) files in the folder will be

C:\Xiang\filename1.txt.gz (filename1.txt)
C:\Xiang\filename2.txt.gz (filename2.txt)
C:\Xiang\some_prefix_filename3.txt.gz (filename3.txt)
...
C:\Xiang\xmlfile1.xml_some_postfix.gz   (xmlfile1.xml)
C:\Xiang\yyyymmddxmlfile2.xml.gz       (xmlfile2.xml)
...
C:\Xiang\someotherName.csv.gz            (someotherName.csv)
C:\Xiang\possiblePrefixsomeotherfile1.someotherExtension.gz (someotherfile1.someotherExtension)
C:\Xiang\someotherfile2.someotherExtensionPossiblePostfix.gz (someotherfile2.someotherExtension)
...

How could I simply up-zip all the .gz files in Python on Windows 10 under the folder C:\Xiang and save into folder C:\UnZipGz, honor the original filenames, with the result as follows:

C:\UnZipGz\filename1.txt
C:\UnZipGz\filename2.txt
C:\UnZipGz\filename3.txt
...
C:\UnZipGz\xmlfile1.xml.
C:\UnZipGz\xmlfile2.xml.
...
C:\UnZipGz\someotherName.csv.
C:\UnZipGz\someotherfile1.someotherExtension
C:\UnZipGz\someotherfile2.someotherExtension
...

Generally, the gz files naming convention are consistent with the filenames of the files inside, but it is not always the case. Somehow, renaming to the some .gz files happened in the past. Now the gz file names does not necessarily match with the filenames of the file in gz files.

How could I extract all the gz files and keep the original file filenames and extensions. I.e, regardless how the gz files are named, when extracting gz files, only save the un-zip files in the original format as

filename.fileExtension

into the C:\UnZipGz folder.


Solution

  • import gzip
    import os
    
    
    INPUT_DIRECTORY = 'C:\Xiang'
    OUTPUT_DIRECTORY = 'C:\UnZipGz'
    GZIP_EXTENSION = '.gz'
    
    
    def make_output_path(output_directory, zipped_name):
        """ Generate a path to write the unzipped file to.
    
        :param str output_directory: Directory to place the file in
        :param str zipped_name: Name of the zipped file
        :return str:
        """
        name_without_gzip_extension = zipped_name[:-len(GZIP_EXTENSION)]
        return os.path.join(output_directory, name_without_gzip_extension)
    
    
    for file in os.scandir(INPUT_DIRECTORY):
        if not file.name.lower().endswith(GZIP_EXTENSION):
            continue
    
        output_path = make_output_path(OUTPUT_DIRECTORY, file.name)
    
        print('Decompressing', file.path, 'to', output_path)
    
        with gzip.open(file.path, 'rb') as file:
            with open(output_path, 'wb') as output_file:
                output_file.write(file.read())
    

    Explanation:

    1. Iterate through all files in the folder with the relevant extension.
    2. Generate a path to the new directory without the gzip extension.
    3. Open the file and write its decompressed contents to the new path.

    To retrieve the original file name, you can use gzinfo: https://github.com/PierreSelim/gzinfo

    >>> import gzinfo
    >>> info = gzinfo.read_gz_info('bar.txt.gz')
    >>> info.fname
    'foo.txt'
    

    References to extract original file name: