I am making a sharing protocol, and when you share a folder it gets tar.gz-ipped and inserted in a folder.
It's created like this:
with tarfile.open(full_data_name, "w:gz", format=GNU_FORMAT) as tar_handle:
...
tar_handle.add(file_path)
When you do that again, I'd like to verify and check if new tar.gz is identical to the old one (so I do not need to re-publish it).
I know about pkgdiff and that works fine, but I'd like to do it in python.
I also know I can do it manually, de-zip&tar the files, load up the content and verify byte wise, but isn't there some simpler and less resource hungry method?
I have tried to just check the contents of the tar.gz files (removing the timestamp at byte 4-7) but that only works sometimes, so I guess there is some random reshuffling in the tar part or some randomness in the gz, as pkgdiff says they are the same, but a hex editor shows lots of differences.
You can extract the files in memory and compute checksums. This avoids extracting the entire archive physically.
import tarfile
import hashlib
def get_tar_checksum(tar_path):
with tarfile.open(tar_path, "r:gz") as tar:
checksums = {}
for member in tar.getmembers():
if member.isfile():
file_data = tar.extractfile(member).read()
checksums[member.name] = hashlib.sha256(file_data).hexdigest()
return checksums
def tar_equal(tar_path1, tar_path2):
return get_tar_checksum(tar_path1) == get_tar_checksum(tar_path2)
# Usage
tar1 = "path/to/first.tar.gz"
tar2 = "path/to/second.tar.gz"
if tar_equal(tar1, tar2):
print("The tar.gz files are identical.")
else:
print("The tar.gz files are different.")
The tar_equal(tar_path1, tar_path2)
method returns True
if the compared tars are identical. You can implement your logic from there.