How to Get feature_importance when using sklearn2pmml

Now i trained a gbdt model named 'GB' in python sklearn. And i want to export this trained model into pmml files. But i meet this problem： 1. if i try to put the trained 'GB' model into PMMLpipeline and use sklearn2pmml to export the model. like below:

GB = GradientBoostingClassifier(n_estimators=100,learning_rate=0.05)
GB.fit(train[list(x_features),Train['Target']])
GB_pipeline = PMMLPipeline([("classifier",GB)])
sklearn2pmml.sklearn2pmml(GB_pipeline,pmml='GB.pmml')
importance=gb.feature_importances_

there is a warning 'The 'active_fields' attribute is not set'. and i will lose all the features' names in the exported pmml file.

and if i try to train the model directly in the PMMLPipeline. Since there is no feature_importances_ attribute in the GB_pipeline i cannot observe the features_importance of this model. Like below:

GB_pipeline = PMMLPipeline([("classifier",GradientBoostingClassifier(n_estimators=100,learning_rate=0.05))]) PMMLPipeline.fit(train[list(x_features),Train['Target']]) sklearn2pmml.sklearn2pmml(GB_pipeline,pmml='GB.pmml')

what shall i do that i can both observe the features_importance of the model and also keep the features' names in the exported pmml file. Thank you very much!

Solution

Important points:

Instantiate the classifier outside of pipeline
Instantiate the (PMML-) pipeline, insert this classifier into it.
Fit this pipeline as a whole.
Print the feature importances of this classifier, and export this pipeline into a PMML document.

In your first code example, you're fitting the classifier, but you should be fitting the pipeline as a whole - hence the warning that the internal state of the pipeline is incomplete. In your second code example, you don't have a direct reference to the classifier (however, you could obtain it by "parsing" the last step of the fitted pipeline).

A complete example based on the Iris dataset:

import pandas
iris_df = pandas.read_csv("Iris.csv")

from sklearn.ensemble import GradientBoostingClassifier
from sklearn2pmml import sklearn2pmml, PMMLPipeline
gbt = GradientBoostingClassifier()
pipeline = PMMLPipeline([
    ("classifier", gbt)
])
pipeline.fit(iris_df[iris_df.columns.difference(["Species"])], iris_df["Species"])
print (gbt.feature_importances_)
sklearn2pmml(pipeline, "GBTIris.pmml", with_repr = True)