xgboostpmml

Error writing XGBoost Classifier to pmml with sklearn2pmml


I want to save my XGBoost model as pmml using sklearn2pmml. I'm using Python V3.7.3 with Sklearn 0.20.3 & sklearn2pmml V0.53.0. My data is mainly binary, with just 3 columns of continuous data, I'm running my notebook in Databricks and convert my Spark dataframe to a pandas dataframe. Code snippet below

import xgboost as xgb

from sklearn_pandas import DataFrameMapper
from sklearn.compose import ColumnTransformer

from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn2pmml.decoration import ContinuousDomain
from sklearn.preprocessing import StandardScaler

X = pdf[continuous_features + numericCols]
y = pdf["Label"]


mapper = DataFrameMapper(
  [([cont_column], [ContinuousDomain(), StandardScaler()]) for cont_column in continuous_features] +
  [([c for c in numericCols], None)] # no transformation
)

clf = xgb.XGBClassifier(objective='multi:softprob',eval_metric='auc',num_class = 2,
                        n_jobs =6,max_delta_step=1, min_child_weight=14, gamma=1.5, subsample = 0.8,
                        colsample_bytree = 0.5, max_depth=10, learning_rate = 0.1)


pipeline = PMMLPipeline([
  ("mapper", mapper),
  ("estimator", clf)
])

pipeline.fit(X,y.values.reshape(-1,))

sklearn2pmml(pipeline, "xgb_V1.pmml", with_repr = True)

The pipeline fits to the data, generates a score and prediction with pipeline.score(X,y) and pipeline.predict(X), but when I try to write it to pmml, I get the following error:

Standard output is empty
Standard error:
Feb 21, 2020 1:53:30 PM org.jpmml.sklearn.Main run
INFO: Parsing PKL..
Feb 21, 2020 1:53:30 PM org.jpmml.sklearn.Main run
INFO: Parsed PKL in 47 ms.
Feb 21, 2020 1:53:30 PM org.jpmml.sklearn.Main run
INFO: Converting..
Feb 21, 2020 1:53:30 PM sklearn2pmml.pipeline.PMMLPipeline initTargetFields
WARNING: Attribute 'sklearn2pmml.pipeline.PMMLPipeline.target_fields' is not set. Assuming y as the name of the target field
Feb 21, 2020 1:53:30 PM org.jpmml.sklearn.Main run
SEVERE: Failed to convert
java.lang.IllegalArgumentException: Attribute 'xgboost.sklearn.XGBClassifier._le' has an unsupported value (Python class xgboost.compat.XGBoostLabelEncoder)
	at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:45)
	at org.jpmml.sklearn.PyClassDict.get(PyClassDict.java:82)
	at sklearn.LabelEncoderClassifier.getLabelEncoder(LabelEncoderClassifier.java:40)
	at sklearn.LabelEncoderClassifier.getClasses(LabelEncoderClassifier.java:34)
	at sklearn.ClassifierUtil.getClasses(ClassifierUtil.java:32)
	at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:133)
	at org.jpmml.sklearn.Main.run(Main.java:145)
	at org.jpmml.sklearn.Main.main(Main.java:94)
Caused by: java.lang.ClassCastException: Cannot cast net.razorvine.pickle.objects.ClassDict to sklearn.preprocessing.LabelEncoder
	at java.lang.Class.cast(Class.java:3369)
	at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:43)
	... 7 more

Exception in thread "main" java.lang.IllegalArgumentException: Attribute 'xgboost.sklearn.XGBClassifier._le' has an unsupported value (Python class xgboost.compat.XGBoostLabelEncoder)
	at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:45)
	at org.jpmml.sklearn.PyClassDict.get(PyClassDict.java:82)
	at sklearn.LabelEncoderClassifier.getLabelEncoder(LabelEncoderClassifier.java:40)
	at sklearn.LabelEncoderClassifier.getClasses(LabelEncoderClassifier.java:34)
	at sklearn.ClassifierUtil.getClasses(ClassifierUtil.java:32)
	at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:133)
	at org.jpmml.sklearn.Main.run(Main.java:145)
	at org.jpmml.sklearn.Main.main(Main.java:94)
Caused by: java.lang.ClassCastException: Cannot cast net.razorvine.pickle.objects.ClassDict to sklearn.preprocessing.LabelEncoder
	at java.lang.Class.cast(Class.java:3369)
	at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:43)

I thought it might be a version incompatibility issue between Sklearn and sklearn2pmml as per this post https://github.com/jpmml/sklearn2pmml/issues/197, but I think the versions I have installed should be ok. Any ideas on what's going on with this? Thanks in advance


Solution

  • It is probably a XGBoost package version issue. The SkLearn2PMML package expects the label encoder (XGBClassifier._le attribute) to be a "normal" Scikit-Learn label encoder class (sklearn.preprocessing.(label|_label).LabelEncoder), but in your case it's something different (xgboost.compat.XGBoostLabelEncoder).

    In which XGBOost package version was this xgboost.compat.XGBoostLabelEncoder introduced? It's either some very old, or very new thing.

    In any case, please open a feature request with the JPMML-SkLearn project here to have this issue sorted out.