pythonpython-3.xgziptar

How can you verify two tar.gz files are identical


I am making a sharing protocol, and when you share a folder it gets tar.gz-ipped and inserted in a folder.

It's created like this:

with tarfile.open(full_data_name, "w:gz", format=GNU_FORMAT) as tar_handle:
   ...
   tar_handle.add(file_path)

When you do that again, I'd like to verify and check if new tar.gz is identical to the old one (so I do not need to re-publish it).

I know about pkgdiff and that works fine, but I'd like to do it in python.

I also know I can do it manually, de-zip&tar the files, load up the content and verify byte wise, but isn't there some simpler and less resource hungry method?

I have tried to just check the contents of the tar.gz files (removing the timestamp at byte 4-7) but that only works sometimes, so I guess there is some random reshuffling in the tar part or some randomness in the gz, as pkgdiff says they are the same, but a hex editor shows lots of differences.


Solution

  • You can extract the files in memory and compute checksums. This avoids extracting the entire archive physically.

    import tarfile
    import hashlib
    
    def get_tar_checksum(tar_path):
        with tarfile.open(tar_path, "r:gz") as tar:
            checksums = {}
            for member in tar.getmembers():
                if member.isfile():
                    file_data = tar.extractfile(member).read()
                    checksums[member.name] = hashlib.sha256(file_data).hexdigest()
        return checksums
    
    
    def tar_equal(tar_path1, tar_path2):
        return get_tar_checksum(tar_path1) == get_tar_checksum(tar_path2)
    
    # Usage
    tar1 = "path/to/first.tar.gz"
    tar2 = "path/to/second.tar.gz"
    
    if tar_equal(tar1, tar2):
        print("The tar.gz files are identical.")
    else:
        print("The tar.gz files are different.")
    

    The tar_equal(tar_path1, tar_path2) method returns True if the compared tars are identical. You can implement your logic from there.