pythonnumpyscikit-learnnumpy-ndarrayjoblib

Loading a pipeline with a dense-array conversion step


I trained and saved the following model using joblib:

def to_dense(x):
    return np.asarray(x.todense())

to_dense_array = FunctionTransformer(to_dense, accept_sparse=True)

model = make_pipeline(
    TfidfVectorizer(),
    to_dense_array,
    HistGradientBoostingClassifier()
)

est = model.fit(texts, y)
save_path = os.path.join(os.getcwd(), "VAT_estimator.pkl")
joblib.dump(est, save_path)

Model works fine, accuracy is good and no message is issued during the saving in joblib.

Now, I try to reload the model from joblib using the following code:

import joblib
# Load the saved model
estimator_file = "VAT_estimator.pkl"
model = joblib.load(estimator_file)

I then get the following error messge:

AttributeError: Can't get attribute 'to_dense' on <module '__main__'>

I can't avoid the conversion step to a dense array in the pipeline.

I tried to insert the conversion step back into the model after the import, but, at prediction time, I get the message that FunctionTransformer is not callable.

I can't see any way out.


Solution

  • The issue arises because the FunctionTransformer in your pipeline uses a custom function to_dense defined in the __main__ scope, and when you reload the model with joblib, it doesn't know how to find to_dense since it's not in the same scope.

    To solve this, you need to ensure that the function is defined in the same module (file) when you load the model or provide joblib with a way to find the custom function.

    there are several options to solve this issues : One of them is : to define the Function Outside of main and Save/Load the Model

    example : First, we will recreate what you did mainly :

    import os
    import numpy as np
    import joblib
    from sklearn.preprocessing import FunctionTransformer
    from sklearn.pipeline import make_pipeline
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.ensemble import HistGradientBoostingClassifier
    
    # first: Define the `to_dense` function at the top level
    def to_dense(x):
        return np.asarray(x.todense())
    
    # Create a transformer using the `to_dense` function
    to_dense_array = FunctionTransformer(to_dense, accept_sparse=True)
    
    # second : Define the model pipeline
    model = make_pipeline(
        TfidfVectorizer(),
        to_dense_array,
        HistGradientBoostingClassifier()
    )
    
    # do your stuff there 
    # est = model.fit(texts, y)
    
    # third: Save the model using joblib
    save_path = os.path.join(os.getcwd(), "VAT_estimator.pkl")
    joblib.dump(model, save_path)
    
    

    Now we are going to load the model : PS : ensure that to_dense is defined or imported when you load the model :

    import joblib
    import numpy as np
    from sklearn.preprocessing import FunctionTransformer
    
    # define the `to_dense` function again or import it from your module
    def to_dense(x):
        return np.asarray(x.todense())
    
    # load the saved model
    estimator_file = "VAT_estimator.pkl"
    model = joblib.load(estimator_file)
    
    # Now, you can use `model` to make predictions or further train the model good luck.
    # Example: model.predict(new_texts)
    
    

    By defining to_dense at the top level of your script and ensuring it's present when you load the model, joblib will correctly locate the function, and the pipeline should work without any issues.

    This complete workflow should save and load your model without encountering the AttributeError.