pythontartarfile

What is the fundamental difference between tar (Unix) and tarfile (Python)?


What is the fundamental difference between tarring a folder using tar on Unix and tarfile in Python that results in a different file size?

In the example below, there is an 8.2 MB difference. I'm currently using a Mac. The folder in this example contains a bunch of random text files for testing purposes.

tar -cvf archive_unix.tar files/

python -m tarfile -c archive_pycli.tar files/ # using Python 3.9.6

-rw-r--r--  1 userid  staff  24606720 Oct 15 09:40 archive_pycli.tar
-rw-r--r--  1 userid  staff  16397824 Oct 15 09:39 archive_unix.tar

Solution

  • Interesting question. The documentation of tarfile (https://docs.python.org/3/library/tarfile.html) mentions that the default format for tar archive created by tarfile is, since python 3.8, PAX_FORMAT whereas archives created by the tar command have the GNU format which I believe explains the difference.

    Now to produce the same archive as the tar command and one with the default format (as your command did):

    import tarfile
    with tarfile.TarFile(name='archive-py-gnu.tar', mode='w', format=tarfile.GNU_FORMAT) as tf:
        tf.add('tmp')
    with tarfile.TarFile(name='archive-py-default.tar', mode='w') as tf:
        tf.add('tmp')
    

    For comparison:

    $ tar cf archive-tar.tar tmp/
    $ ls -l 
    3430400 16:28 archive-py-default.tar
    3317760 16:28 archive-py-gnu.tar
    3317760 16:27 archive-tar.tar
    

    Results of the file command:

    $ file archive_unix.tar
    archive_unix.tar: POSIX tar archive (GNU)
    $ file archive-py-gnu.tar
    archive-py-gnu.tar: POSIX tar archive (GNU)
    $ file archive-py-default.tar
    archive-py-default.tar: POSIX tar archive
    

    Now I cannot tell you the difference between the different formats, sorry. But I hope this helps.