pythonscikit-learnsklearn2pmml

sklearn2pmml omits field names


I export an instance of sklearn.preprocessing.StandardScaler into a pmml-file. The problem is, that the names of the fields do not appear in the pmml-file, e.g. when using the iris dataset then the original field names ['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)'] do not appear. Instead only names like x1,x2, etc appear. Is there a way to get the original field names in the pmml-file? The Following code should be runnable:

from sklearn2pmml import sklearn2pmml, PMMLPipeline, make_pmml_pipeline  
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import pandas as pd
data = load_iris()
dfIris = pd.DataFrame(data=data.data, columns=data.feature_names)

ssModel = StandardScaler()
ssModel.fit(dfIris)


pipe = PMMLPipeline([("StandardScaler", ssModel)])
sklearn2pmml(pipeline=make_pmml_pipeline(pipe), pmml="ssIris.pmml")

In the ssIris.pmml I see this: enter image description here


Solution

  • First, I believe you want to fit the PMMLPipeline after initialization so you may use pipe.fit(dfIris) instead of fitting before the ssModel. To preserve the column names add a none preprocessing function that uses DataFrameMapper to map pandas data frame columns to different sklearn transformations before the scaler, as the pipeline expects a preprocessing function in order to keep the column names. I am not sure whether this is the best way but I checked it and it was preserving the column names.

    from sklearn_pandas import DataFrameMapper
    from sklearn2pmml import sklearn2pmml, PMMLPipeline, make_pmml_pipeline
    from sklearn.datasets import load_iris
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler
    import pandas as pd
    data = load_iris()
    dfIris = pd.DataFrame(data=data.data, columns=data.feature_names)
    
    ssModel = StandardScaler()
    pipe.fit(dfIris)
    
    pipe = PMMLPipeline([("df_mapper", 
                      DataFrameMapper([(d, None) for d in data.feature_names], 
    df_out=True)), ("StandardScaler", ssModel)])
    pipe.fit(dfIris)
    sklearn2pmml(pipeline=make_pmml_pipeline(pipe), pmml="ssIris.pmml")