pythonscikit-learnpipelinetransformation

Evaluate transformations with the same model in scikit-learn


I would like to perform a regression analysis and test different transformations of the input variables for the same model. To accomplish this, I created a dictionary with the different pipelines, which I loop through:

import numpy as np
import pandas as pd
from sklearn.preprocessing import FunctionTransformer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PowerTransformer
from sklearn.compose import TransformedTargetRegressor

# Define transformations and models
models = {
    'linear': LinearRegression(),
    'power': make_pipeline(PowerTransformer(), LinearRegression()),
    'log': make_pipeline(FunctionTransformer(np.log, np.exp),
                         LinearRegression()),
    'log-sqrt': TransformedTargetRegressor(
        regressor=make_pipeline(
            FunctionTransformer(np.log, np.exp),
            LinearRegression()),
        func=np.sqrt,
        inverse_func=np.square
        )
    }

parameters = pd.DataFrame()
for name, model in models.items():
    model.fit(x_train, y_train)
    y_hat = model.predict(x_hat)
    y_hat_train = model.predict(x_train)
    r2 = model.score(x_train, y_train)
    parameters.at[name, 'MSE'] = mean_squared_error(y_train, y_hat_train)
    parameters.at[name, 'R2'] = r2
best_model = parameters['R2'].idxmax()

This works. However, there is probably a more elegant solution similar to GridSearchCV for evaluating models. Can anyone give me some advice on what I should be looking for?


Solution

  • I actually found a solution similar to Ben Reiniger, but using GridSearchCV. The transformation of the target variable was not smooth at first, but Ben's solution helped, and now it works for this transformation too.

    The disadvantage compared to my initial approach involving a dictionary and a loop is that, as Ben noted, more models are generated and the individual ones are not accessible.

    Here's my solution using GridSearchCV:

    from sklearn.model_selection import GridSearchCV
    from sklearn.pipeline import Pipeline
    
    pipeline = Pipeline([('transformer', None), ('estimator', LinearRegression())])
    parameters = {
        'transformer': [
            FunctionTransformer(np.log, np.exp),
            PowerTransformer()
            ],
        'estimator': [
            TransformedTargetRegressor(
                regressor=LinearRegression(),
                func=np.sqrt,
                inverse_func=np.square
                )
            ]
        }
    grid_search = GridSearchCV(pipeline, parameters, cv=5)
    grid_search.fit(X, Y)
    best_model = grid_search.best_estimator_