I have the following code which downloads and extracts a dataset into a directory. The problem is the line
housing_tarball.extractall(path='datasets')
If I set it to path='dataset' it will extract it into: /dataset/housing.
if I set it to path='dataset/housing/' then it will extract it into: /dataset/housing/housing directory
so it automatically adds the housing directory and puts it in there without specifing it in the path. Does it take the path from where the tarfile lies: tarball_path = Path("datasets/housing.tgz") ?
here is the complete code:
from pathlib import Path
import pandas as pd
import tarfile
import urllib.request
import os
def load_housing_data():
tarball_path = Path("datasets/housing/housing.tgz")
if not tarball_path.is_file():
Path("datasets/housing").mkdir(parents=True, exist_ok=True) # create the directory if it does not exist
url = "https://github.com/ageron/data/raw/main/housing.tgz"
with urllib.request.urlopen(url) as response, tarball_path.open(mode="wb") as tarball_file: # open the file for writing in binary mode
tarball_file.write(response.read()) # write the response to the file
with tarfile.open(tarball_path) as housing_tarball: # open the tarball
housing_tarball.extractall(path='datasets') # extract all the files to the datasets directory
# remove the tarball flie
os.remove(tarball_path)
return pd.read_csv("datasets/housing/housing.csv")
Yes, it does automatically create a /housing/
directory
If you think about it, this isn't really surprising. Tarballs are usually compressed folders, therefore they should extract as folders. I assume that whoever created the tarball you're using, tarred a folder as opposed to a single file.