pythonscikit-learn

What is the rationale behind `TransformedTargetRegressor` always cloning the given `regressor` and how to prevent this behavior?


The docs of sklearn.compose.TransformedTargetRegressor state that:

regressor object, default=None

Regressor object such as derived from RegressorMixin. This regressor will automatically be cloned each time prior to fitting. If regressor is None, LinearRegression is created and used.

What is the rationale behind cloning the given regressor each time prior to fitting? Why would this be useful?

This behavior prevents, for example, the following code from working:

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.compose import TransformedTargetRegressor


X = np.random.default_rng(seed=1).normal(size=(100,3))
y = np.random.default_rng(seed=1).normal(size=100)

model = RandomForestRegressor()
pipeline = Pipeline(
    steps=[
        ('normalize', StandardScaler()),
        ('model', model),
    ],
)
tt = TransformedTargetRegressor(regressor=pipeline, transformer=StandardScaler())
tt.fit(X, y)

print(model.feature_importances_)

It results in:

Traceback (most recent call last):
  File "/tmp/test.py", line 21, in <module>
    print(model.feature_importances_)
[...]
sklearn.exceptions.NotFittedError: This RandomForestRegressor instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

which is not surprising given that the model object is cloned by the TransformedTargetRegressor.

So, is there a way to prevent this cloning behavior and make the above code work?


Solution

  • All sklearn meta-estimators (except Pipeline) clone their base estimators; I can't confidently answer why the developers chose that paradigm.

    But the fitted base estimators are always made available in new attributes: instead of

    model.feature_importances_
    

    use

    tt.regressor_['model'].feature_importances_