I'm training a regression model and inside my pipeline I have something like this:
best_pipeline = Pipeline(
steps=[
(
"features",
ColumnTransformer(
transformers=[
(
"area",
make_pipeline(
impute.SimpleImputer(),
pr.FunctionTransformer(lambda x: np.log1p(x)),
StandardScaler(),
),
["area"],
)
]
),
),
(
"regressor",
TransformedTargetRegressor(
regressor=model,
transformer=PowerTransformer(method='box-cox')
),
),
]
)
There are obviously more features but the code will be too long. So I train the model and if I predict in the same script everything is fine. I store the model using dill and then try to use it in another python file.
In this other file I load the model and try this:
import numpy as np
df['prediction'] = self.model.predict(df)
And internally, when it tries to do the transform
it returns:
NameError: name 'np' is not defined
You can use third-party library functions by simply passing the name of the function as a func
argument:
import numpy
transformer = FunctionTransformer(numpy.log1p)
There is no need for lambdas or custom wrapper classes. Also, the above solution is persistable in plain pickle data format.
When porting objects between different environments, then it's probably a good idea to use canonical module names. Hence numpy.log1p
instead of np.log1p
.