pythonextracttarfile

TarFile.extractall base path wrong, python?


I have the following code which downloads and extracts a dataset into a directory. The problem is the line

housing_tarball.extractall(path='datasets') 

If I set it to path='dataset' it will extract it into: /dataset/housing.

if I set it to path='dataset/housing/' then it will extract it into: /dataset/housing/housing directory

so it automatically adds the housing directory and puts it in there without specifing it in the path. Does it take the path from where the tarfile lies: tarball_path = Path("datasets/housing.tgz") ?

here is the complete code:

from pathlib import Path
import pandas as pd
import tarfile
import urllib.request
import os

def load_housing_data():
    tarball_path = Path("datasets/housing/housing.tgz")

    if not tarball_path.is_file():
        Path("datasets/housing").mkdir(parents=True, exist_ok=True) # create the directory if it does not exist
        url = "https://github.com/ageron/data/raw/main/housing.tgz"

        with urllib.request.urlopen(url) as response, tarball_path.open(mode="wb") as tarball_file: # open the file for writing in binary mode
            tarball_file.write(response.read()) # write the response to the file

    with tarfile.open(tarball_path) as housing_tarball: # open the tarball
        housing_tarball.extractall(path='datasets') # extract all the files to the datasets directory

    # remove the tarball flie
    os.remove(tarball_path)

    return pd.read_csv("datasets/housing/housing.csv")

Solution

  • Yes, it does automatically create a /housing/ directory

    If you think about it, this isn't really surprising. Tarballs are usually compressed folders, therefore they should extract as folders. I assume that whoever created the tarball you're using, tarred a folder as opposed to a single file.