python-2.7gziptarmd5sum

how to create archive whose keep same md5 hash for identical content in Python?


As explain in this article https://medium.com/@mpreziuso/is-gzip-deterministic-26c81bfd0a49 the md5 of two .tar.gz files that are the compression of the exact same set of files can be different. This is because it, for example, includes timestamp in the header of the compressed file.

In the article 3 solutions are proposed, and I would ideally like to use the first one which is:

We can use the -n flag in gzip which will make gzip omit the timestamp and the file name from the file header;

And this solution works well:

tar -c ./bin |gzip -n >one.tar.gz
tar -c ./bin |gzip -n >two.tar.gz
md5sum one.tgz two.tgz

Nevertheless I have no idea of what will be a good way to do it in Python. Is there a way to do it with tarfile(https://docs.python.org/2/library/tarfile.html)?


Solution

  • As a workaround you can use the bzip2 compression instead. It does not seem to have this problem:

    import tarfile
    
    tar1 = tarfile.open("one.tar.bz2", "w:bz2")
    tar1.add("bin")
    tar1.close()
    
    tar2 = tarfile.open("two.tar.bz2", "w:bz2")
    tar2.add("bin")
    tar2.close()
    

    Running the md5 gives:

    martin@martin-UX305UA:~/test$ md5sum one.tar.bz2 two.tar.bz2 
    e9ec2fd4fbdfae465d43b2f5ecaecd2f  one.tar.bz2
    e9ec2fd4fbdfae465d43b2f5ecaecd2f  two.tar.bz2