I am exporting a PMMLPipeline with a categorical string feature day_of_week
as a PMML
file. When I open the file in Java and list the InputFields
I see that the data type of day_of_week
field is double:
InputField{name=day_of_week, fieldName=day_of_week, displayName=null, dataType=double, opType=categorical}
Hence when I evaluate an input I get the error:
org.jpmml.evaluator.InvalidResultException: Field "day_of_week" cannot accept user input value "tuesday"
On the Python side the pipeline works with a string column:
data = pd.DataFrame(data=[{"age": 10, "day_of_week": "tuesday"}])
y = trained_model.predict(X=data)
Miminal example for creating the PMML file:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline
if __name__ == '__main__':
data_dict = {
'age': [1, 2, 3],
'day_of_week': ['monday', 'tuesday', 'wednesday'],
'y': [5, 6, 7]
}
data = pd.DataFrame(data_dict, columns=data_dict)
numeric_features = ['age']
numeric_transformer = Pipeline(steps=[
('scaler', StandardScaler())])
categorical_features = ['day_of_week']
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore', categories='auto'))])
preprocessor = ColumnTransformer(
transformers=[
('numerical', numeric_transformer, numeric_features),
('categorical', categorical_transformer, categorical_features)])
pipeline = PMMLPipeline(
steps=[
('preprocessor', preprocessor),
('classifier', RandomForestRegressor(n_estimators=60))])
X = data.drop(labels=['y'], axis=1)
y = data['y']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=30)
trained_model = pipeline.fit(X=X_train, y=y_train)
sklearn2pmml(pipeline=pipeline, pmml='RandomForestRegressor2.pmml', with_repr=True)
EDIT:
sklearn2pmml
creates a PMML file with A DataDictionary with DataField "day_of_week" that has dataType="double"
. I think it should be "String". Do I have to set the dataType somewhere to correct this?
<DataDictionary>
<DataField name="day_of_week" optype="categorical" dataType="double">
You can assist SkLearn2PMML by providing "feature type hints" using sklearn2pmml.decoration.CategoricalDomain
and sklearn2pmml.decoration.ContinuousDomain
decorators (see here for more details).
In the current case, you should prepend a CategoricalDomain
step to the pipeline that deals with categorical features:
from sklearn2pmml.decoration import CategoricalDomain
categorical_transformer = Pipeline(steps=[
('domain', CategoricalDomain(dtype = str))
('onehot', OneHotEncoder(handle_unknown='ignore', categories='auto'))
])