What is the fundamental difference between tarring a folder using tar
on Unix and tarfile
in Python that results in a different file size?
In the example below, there is an 8.2 MB difference. I'm currently using a Mac. The folder in this example contains a bunch of random text files for testing purposes.
tar -cvf archive_unix.tar files/
python -m tarfile -c archive_pycli.tar files/ # using Python 3.9.6
-rw-r--r-- 1 userid staff 24606720 Oct 15 09:40 archive_pycli.tar
-rw-r--r-- 1 userid staff 16397824 Oct 15 09:39 archive_unix.tar
Interesting question. The documentation of tarfile
(https://docs.python.org/3/library/tarfile.html) mentions that the
default format for tar archive created by tarfile
is, since python
3.8, PAX_FORMAT
whereas archives created by the tar
command have
the GNU format which I believe explains the difference.
Now to produce the same archive as the tar
command and one with the
default format (as your command did):
import tarfile
with tarfile.TarFile(name='archive-py-gnu.tar', mode='w', format=tarfile.GNU_FORMAT) as tf:
tf.add('tmp')
with tarfile.TarFile(name='archive-py-default.tar', mode='w') as tf:
tf.add('tmp')
For comparison:
$ tar cf archive-tar.tar tmp/
$ ls -l
3430400 16:28 archive-py-default.tar
3317760 16:28 archive-py-gnu.tar
3317760 16:27 archive-tar.tar
Results of the file
command:
$ file archive_unix.tar
archive_unix.tar: POSIX tar archive (GNU)
$ file archive-py-gnu.tar
archive-py-gnu.tar: POSIX tar archive (GNU)
$ file archive-py-default.tar
archive-py-default.tar: POSIX tar archive
Now I cannot tell you the difference between the different formats, sorry. But I hope this helps.