python numpy scikit-learn numpy-ndarray joblib

Loading a pipeline with a dense-array conversion step

I trained and saved the following model using joblib:

def to_dense(x):
    return np.asarray(x.todense())

to_dense_array = FunctionTransformer(to_dense, accept_sparse=True)

model = make_pipeline(
    TfidfVectorizer(),
    to_dense_array,
    HistGradientBoostingClassifier()
)

est = model.fit(texts, y)
save_path = os.path.join(os.getcwd(), "VAT_estimator.pkl")
joblib.dump(est, save_path)

Model works fine, accuracy is good and no message is issued during the saving in joblib.

Now, I try to reload the model from joblib using the following code:

import joblib
# Load the saved model
estimator_file = "VAT_estimator.pkl"
model = joblib.load(estimator_file)

I then get the following error messge:

AttributeError: Can't get attribute 'to_dense' on <module '__main__'>

I can't avoid the conversion step to a dense array in the pipeline.

I tried to insert the conversion step back into the model after the import, but, at prediction time, I get the message that FunctionTransformer is not callable.

I can't see any way out.

Solution

The issue arises because the FunctionTransformer in your pipeline uses a custom function to_dense defined in the __main__ scope, and when you reload the model with joblib, it doesn't know how to find to_dense since it's not in the same scope.

To solve this, you need to ensure that the function is defined in the same module (file) when you load the model or provide joblib with a way to find the custom function.

there are several options to solve this issues : One of them is : to define the Function Outside of main and Save/Load the Model

example : First, we will recreate what you did mainly :

we will define the to_dense function at the top level of your script or in a separate module
Then we are going to save it

import os
import numpy as np
import joblib
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import HistGradientBoostingClassifier

# first: Define the `to_dense` function at the top level
def to_dense(x):
    return np.asarray(x.todense())

# Create a transformer using the `to_dense` function
to_dense_array = FunctionTransformer(to_dense, accept_sparse=True)

# second : Define the model pipeline
model = make_pipeline(
    TfidfVectorizer(),
    to_dense_array,
    HistGradientBoostingClassifier()
)

# do your stuff there 
# est = model.fit(texts, y)

# third: Save the model using joblib
save_path = os.path.join(os.getcwd(), "VAT_estimator.pkl")
joblib.dump(model, save_path)

Now we are going to load the model : PS : ensure that to_dense is defined or imported when you load the model :

import joblib
import numpy as np
from sklearn.preprocessing import FunctionTransformer

# define the `to_dense` function again or import it from your module
def to_dense(x):
    return np.asarray(x.todense())

# load the saved model
estimator_file = "VAT_estimator.pkl"
model = joblib.load(estimator_file)

# Now, you can use `model` to make predictions or further train the model good luck.
# Example: model.predict(new_texts)

By defining to_dense at the top level of your script and ensuring it's present when you load the model, joblib will correctly locate the function, and the pipeline should work without any issues.

This complete workflow should save and load your model without encountering the AttributeError.