pythoncompressionarchivepython-magic

Differentiating between compressed .gz files and archived tar.gz files properly?


What is the proper way to deal with differentiating between a plain compressed file in gzip or bzip2 format (eg. .gz) and a tarball compressed with gzip or bzip2 (eg. .tar.gz) Identification using suffix extensions is not a reliable option as it's possible files may end up renamed.

Now on the command line I am able to do something like this:

bzip2 -dc test.tar.bz2 |head|file -

So I attempted something similar in python with the following function:

def get_magic(self, store_file, buffer=False, look_deeper=False):
    # see what we're indexing
    if look_deeper == True:
        m = magic.Magic(mime=True, uncompress=True)
    else:
        m = magic.Magic(mime=True) 

    if buffer == False:
        try:
            file_type = m.from_file(store_file)

        except Exception, e:
            raise e

    else:
        try:
            file_type = m.from_buffer(store_file)

        except Exception, e:
            raise e

    return file_type 

Then when trying to read a compressed tarball I'll pass in the buffer from elsewhere via:

    file_buffer = open(file_name).read(8096) 
    archive_check = self.get_magic(file_buffer, True, True)

Unfortunately this then becomes problematic using the uncompress flag in python-magic because it appears that python-magic is expecting me to pass in the entire file even though I only want it to read the buffer. I end up with the exception:

bzip2 ERROR: Compressed file ends unexpectedly

Seeing as the the files I am looking at can end up being 2M to 20GB in size this becomes rather problematic. I don't want to read the entire file.

Can it be hacked and chop the end of the compressed file off and append it to the buffer? Is it better to ignore the idea of uncompressing the file using python-magic and instead do it before I pass in a buffer to identify via:

    file_buffer = open(file_name, "r:bz2").read(8096) 

Is there a better way?


Solution

  • It is very likely a tar file if the uncompressed data at offset 257 is "ustar", or if the uncompressed data in its entirety is 1024 zero bytes (an empty tar file).

    You can read just the first 1024 bytes of the uncompressed data using z = zlib.decompressobj() or z = bz2.BZ2Decompressor(), and z.decompress().