As explain in this article https://medium.com/@mpreziuso/is-gzip-deterministic-26c81bfd0a49 the md5 of two .tar.gz files that are the compression of the exact same set of files can be different. This is because it, for example, includes timestamp in the header of the compressed file.
In the article 3 solutions are proposed, and I would ideally like to use the first one which is:
We can use the -n flag in gzip which will make gzip omit the timestamp and the file name from the file header;
And this solution works well:
tar -c ./bin |gzip -n >one.tar.gz
tar -c ./bin |gzip -n >two.tar.gz
md5sum one.tgz two.tgz
Nevertheless I have no idea of what will be a good way to do it in Python. Is there a way to do it with tarfile(https://docs.python.org/2/library/tarfile.html)?
As a workaround you can use the bzip2
compression instead. It does not seem to have this problem:
import tarfile
tar1 = tarfile.open("one.tar.bz2", "w:bz2")
tar1.add("bin")
tar1.close()
tar2 = tarfile.open("two.tar.bz2", "w:bz2")
tar2.add("bin")
tar2.close()
Running the md5
gives:
martin@martin-UX305UA:~/test$ md5sum one.tar.bz2 two.tar.bz2
e9ec2fd4fbdfae465d43b2f5ecaecd2f one.tar.bz2
e9ec2fd4fbdfae465d43b2f5ecaecd2f two.tar.bz2