pythonlibrariespython-packaging

Structuring project folders with many datasets and models


I have a folder with different data, on which I'm training different models that I put in a single py file: models.py, as it's showed here:

my project
├───data
│   ├───data1.py
│   ├───data2.py
│   ├───data3.py
└───models.py
└───utils.py
└───etsimator.py

estimator is the method I'm testing to estimate some properties of these models. Hence, I need to train those models on each dataset to create a well-defined model for which I'm estimating its property.

However, to define a model, we need to go through the training process which requires access to the data file. For example, if we're considering training 2 models (logistic regression and decision tree) this will give a total of 6 models since we have 3 datasets.

Is there a better way (more compact) to organize my project?

EDIT

The data folder contains data preprocessing code, this is an example:

import numpy as np
def load_student_perf():
    X = np.load("data/raw/student_perf_X.npy")
    y = np.load("data/raw/student_perf_y.npy")
    return X, y
X, y = load_student_perf()

Solution

  • Your project structure looks fine to me. However, I would personally like to organize the model scripts into separate files under the same directory. Here's how I would have organized this project:

    my_project
    ├── data
    │   ├── data1.csv
    │   ├── data2.csv
    │   ├── data3.csv
    ├── models
    │   ├── __init__.py
    │   ├── logistic_regression.py
    │   ├── decision_tree.py
    ├── utils
    │   ├── __init__.py
    │   ├── data_loader.py
    ├── estimator.py
    

    data_loader module abstracts loading data from the data folder for different datasets. Also each model now has a separate script under models directory.

    So simply the estimator script would look like this:

    from models import LogisticRegressionModel, DecisionTreeModel
    from utils.data_loader import load_data
    
    # Load data
    data1 = load_data("data/data1.csv")
    data2 = load_data("data/data2.csv")
    data3 = load_data("data/data3.csv")
    
    # Train models
    logistic_model_1 = LogisticRegressionModel()
    logistic_model_1.train(data1.drop("target", axis=1), data1["target"])
    
    logistic_model_2 = LogisticRegressionModel()
    logistic_model_2.train(data2.drop("target", axis=1), data2["target"])
    
    # ... and so on for other models and datasets
    
    # Rest of your code here
    # ...