pythontarfile

Why does it sometimes look like Python tarfile is extracting a directory with its contents?


I am trying to work with tar files using Python in an application running in a Docker container. Since I don't use tarfile regularly, I drafted some code and ran it in my local Python environment so I could verify that it worked the way I want before I run it in the container.

For simplicity, say I have a directory a/ that contains a file test_file.py, and I tarred up a/ recursively to a tar file file.tar.gz. When I check the contents of the tar file, I see both the directory and the text file (a/ and a/test_file.py). When I extract the directory in my local environment, I get the directory and its contents. When I run the same code in my container, I get only the empty directory with no files. After having this problem, I did some searching and found posts like Extract only a single directory from tar (in python) that recommend explicitly including all files that you want to extract and not assuming that they will come with their parent directory, so I can do that. But...

Update: What I thought was happening was not actually happening, as the answer explains. There is no strange or mysterious behavior, just flawed logic in my test code because I forgot to delete the original directory structure before untarring my tar file.

What's bothering me is this inconsistent behavior between my two environments. My local environment is using Python 3.9.15 on Ubuntu 22.04. My Docker container is using Python 3.10.9 on Ubuntu 18.04. So the environments are different, but it seems weird to have this discrepancy still.

Here's some sample code. I know I mentioned using the extract method earlier. I tried using that first, then switched to extractall with the members kwarg later, and I get the same behavior -- the file comes too in my local environment but not in the Docker container.

import logging
import os
from pathlib import Path
import tarfile
import tempfile


def main():
    with tempfile.TemporaryDirectory() as temp_dir:
        # make a tar file with one directory that contains one file
        file_path = Path(temp_dir) / "file.tar.gz"
        os.chdir(temp_dir)
        logging.info("Changed dir to %s", os.getcwd())
        subdir = "a"
        os.mkdir(subdir)
        with open(Path(subdir) / "test_file.py", "w") as file_obj:
            file_obj.write("import this\n")
        with tarfile.open(file_path, "w:gz") as tar:
            tar.add(subdir, recursive=True)

        # unpack tar file
        with tarfile.open(file_path) as tar:
            file_names = tar.getnames()
            logging.info(file_names)
            members = [tar.getmember(subdir)]
            tar.extractall(path=temp_dir, members=members)

        # check contents of extracted directory
        dir_path = Path(temp_dir) / subdir
        output = os.listdir(dir_path)
        logging.info("%s contents: %s", dir_path, output)

Logging shows that the directory and the file are in the tar file (from call to tar.getnames()) in both environments, but the output of the last listdir call is the file in one environment and an empty list in the other.


Solution

  • There are two problems. I assume on the environment you got only an empty subdirectory you only ran the unpacking part.

    1. You never remove the original files, and you extract into the same directory. Thus, when you unpack, it is not possible to distinguish whether the files you see on the disk are the result of unpacking, or the original files. Putting shutil.rmtree(subdir) after you create the tarball should solve this.

    2. Once you solve the first problem, you will see that the result is only ever creating the empty directory. This is because you explicitly request it: your members is only ["a"] (or rather the TarInfo version of it), and thus only ["a"] is extracted, just like your link warned you. Removing the members=members parameter, or using members = [tar.getmember("a"), tar.getmember("a/test_file.py")], will get you the desired result. Even [tar.getmember("a/test_file.py")] will be fine: a directory will be created for you in this case even though it is not listed for extraction.)