I need to create a Github repository where I can organize Jupyter notebooks by topic as tutorials. Some notebooks will require to load large(r) data files, which I don't want to be part of the repository themselves.
My idea is to provide all data files in a different online resource, and download the required files in the notebooks using some custom auxiliary method in some utils.py
script.
Since I want to use different subfolders for organizing the notebooks, utils.py
would need to reside in a parent folder. However, loading .py
files from a parent folder within a notebook seems to require manually tweaking the class path in the notebook.
I guess an alternative would be to put utils.py
(and other shared code) into its own package that needs to be installed before using the notebook. Kind feels like overkill?
Is there some other and better alternative to handle this.
Use a single utils.py
file and structuring notebooks into topic-based folders.
Christian_tutorials/
│
├── utils.py
│
├── data/
└── notebooks/
├── topic1/
│ ├── notebook1.ipynb
│ └── notebook2.ipynb
└── topic2/
└── notebook3.ipynb
Assumptions
Utils.py - Assuming shared functions here.
data/- Assumng local folder forthe storage and download.
Accessing the utils
import sys
sys.path.append('../..')
from utils import get_data
# Now you can use it to get data:
data_file = get_data(
filename="example.csv",
url="https://storage_url/example.csv"
)
import pandas as pd
df = pd.read_csv(data_file)
utils.py
import os
import requests
from pathlib import Path
def get_data(filename, url):
data_dir = Path("data")
data_dir.mkdir(exist_ok=True)
file_path = data_dir / filename
if not file_path.exists():
print(f"Downloading {filename}...")
response = requests.get(url)
response.raise_for_status()
with open(file_path, 'wb') as f:
f.write(response.content)
print("Download complete!")
return file_path
The get_data
function streamlines data handling by creating a data folder if it doesn’t exist, checking for the requested file, and downloading it only if necessary