pythonmachine-learningscikit-learnimputation

Accessing the values used to impute and normalize new data based upon scikit-learn ColumnTransformer


Using scikit-learn I'm building machine learning models on a training set, and then evaluating them on a test set. On the train set I perform data imputation and scaling with the ColumnTransformer, then build a logistic regression model using Kfold CV, and the final model is used to predict the values on the test set. The final model is also using its results from ColumnTransformer to impute the missing values on the test set. For example min-max scalar would be taking the min and max values from the train set and would use those values when scaling the test set. How can I see these scaling values that are derived from the the train set and then used to predict on the test set? I can't find anything on the scikit-learn documentation about it. Here is the code I'm using:

from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

def preprocessClassifierLR(categorical_vars, numeric_vars):###categorical_vars and numeric_vars are lists defining the column names of the categorical and numeric variables present in X


    categorical_pipeline = Pipeline(steps=[('mode', SimpleImputer(missing_values=np.nan, strategy="most_frequent")),
                                           ("one_hot_encode", OneHotEncoder(handle_unknown='ignore'))])

    numeric_pipeline = Pipeline(steps=[('numeric', SimpleImputer(strategy="median")),
                                       ("scaling", MinMaxScaler())])

    col_transform = ColumnTransformer(transformers=[("cats", categorical_pipeline, categorical_vars),
                                                    ("nums", numeric_pipeline, numeric_vars)])

    lr = SGDClassifier(loss='log_loss', penalty='elasticnet')
    model_pipeline = Pipeline(steps=[('preprocess', col_transform),
                                     ('classifier', lr)])


    random_grid_lr = {'classifier__alpha': [1e-1, 0.2, 0.5],
                      'classifier__l1_ratio': [1e-3, 0.5]}

    kfold = RepeatedStratifiedKFold(n_splits=10, n_repeats=10, random_state=47)

    param_search = GridSearchCV(model_pipeline, random_grid_lr, scoring='roc_auc', cv=kfold, refit=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

param_search = preprocessClassifierLR(categorical_vars, numeric_vars)
train_mod = param_search.fit(X_train, y_train)
print("Mod AUC:", train_mod.best_score_)

test_preds = train_mod.predict_proba(X_)[:,1]

I can't provide the real data, but X is a dataframe with the independent variables and y is the binary outcome variable. train_mod is a pipeline which contains the columntransformer and SGDclassifier steps. I can easily get similar parameter information from the classifier such as the optimal lambda and alpha values by running: train_mod.best_params_, but I cannot figure out the stats used for the column transformer such as 1) the modes used for the simple imputer for the categorical features, 2) the median values used for the simple imputer for the numeric features, and 3) the min and max values used for the scaling of the numeric features. How to access this information?

I assumed that train_mod.best_estimator_['preprocess'].transformers_ would contain this information, in a similar way to how train_mod.best_params_ gives me the alpha and lambda values derived from the model training that are then applied to the test set.


Solution

  • Pipelines, ColumnTransformers, GridSearch, and others all have attributes (and sometimes a custom __getitem__ to access these like dictionaries) exposing their component parts, and similarly each of the transformers has fitted statistics as attributes, so it's just a matter of chaining these all together, e.g.:

    (
        train_mod  # is a grid search, has the next attribute
        .best_estimator_ # is a pipeline, has steps accessible by getitem
        ['preprocess'] # is a columntransformer
        .named_transformers_ # fitted transformers, accessed by getitem
        ['cats']  # pipeline
        ['mode']  # simpleimputer
        .statistics_  # the computed modes, per column seen by this simpleimputer.
    )