pythonpmml

How to Get feature_importance when using sklearn2pmml


Now i trained a gbdt model named 'GB' in python sklearn. And i want to export this trained model into pmml files. But i meet this problem: 1. if i try to put the trained 'GB' model into PMMLpipeline and use sklearn2pmml to export the model. like below:

GB = GradientBoostingClassifier(n_estimators=100,learning_rate=0.05)
GB.fit(train[list(x_features),Train['Target']])
GB_pipeline = PMMLPipeline([("classifier",GB)])
sklearn2pmml.sklearn2pmml(GB_pipeline,pmml='GB.pmml')
importance=gb.feature_importances_

there is a warning 'The 'active_fields' attribute is not set'. and i will lose all the features' names in the exported pmml file.

  1. and if i try to train the model directly in the PMMLPipeline. Since there is no feature_importances_ attribute in the GB_pipeline i cannot observe the features_importance of this model. Like below:

    GB_pipeline = PMMLPipeline([("classifier",GradientBoostingClassifier(n_estimators=100,learning_rate=0.05))]) PMMLPipeline.fit(train[list(x_features),Train['Target']]) sklearn2pmml.sklearn2pmml(GB_pipeline,pmml='GB.pmml')

what shall i do that i can both observe the features_importance of the model and also keep the features' names in the exported pmml file. Thank you very much!


Solution

  • Important points:

    1. Instantiate the classifier outside of pipeline
    2. Instantiate the (PMML-) pipeline, insert this classifier into it.
    3. Fit this pipeline as a whole.
    4. Print the feature importances of this classifier, and export this pipeline into a PMML document.

    In your first code example, you're fitting the classifier, but you should be fitting the pipeline as a whole - hence the warning that the internal state of the pipeline is incomplete. In your second code example, you don't have a direct reference to the classifier (however, you could obtain it by "parsing" the last step of the fitted pipeline).

    A complete example based on the Iris dataset:

    import pandas
    iris_df = pandas.read_csv("Iris.csv")
    
    from sklearn.ensemble import GradientBoostingClassifier
    from sklearn2pmml import sklearn2pmml, PMMLPipeline
    gbt = GradientBoostingClassifier()
    pipeline = PMMLPipeline([
        ("classifier", gbt)
    ])
    pipeline.fit(iris_df[iris_df.columns.difference(["Species"])], iris_df["Species"])
    print (gbt.feature_importances_)
    sklearn2pmml(pipeline, "GBTIris.pmml", with_repr = True)