pythonscikit-learnpipelinepmmlsklearn2pmml

sklearn to pmml pipeline how to apply postprocessing linear trasnformation


I'm having a tough time trying to apply a postprocessing step with the sklearn2pmml packages. What I'm trying to do is to apply a linear transformation after applying the predict_proba method within the PMMMLPipeline class in sklearn2pmml package. Any idea about how to do this? Even a solution outside this package but automatable would help me (like modifying automatically the XML from the PMML).

Here's an example so you can get a deeper understanding of what I'm trying to do:

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn2pmml import make_pmml_pipeline, sklearn2pmml

# FORGET ABOIT TRAIN TEST SPLIT; we only care if  the  PMML pipeline works for now 
BIRTHDAY_SEED = 1995
nrows, cols = 1000, 5
X, y = make_classification(n_samples=nrows, n_features=cols, n_informative=2, n_redundant=3, n_classes=2, shuffle=True, random_state=BIRTHDAY_SEED)
X, y = pd.DataFrame(X), pd.Series(y)
model = DecisionTreeClassifier()
model.fit(X,y)

def postprocessig_linear_transformation(probabilities, a,b):
    "This function would multiply proabilities by a and sum b"
    return probabilities*a+b

# the pipeline should look like this
# first predict probabilities
probabilities = model.predict_proba(X)[:,0]
# then scale them (apply linear transformation)
probabilities_scaled = postprocessig_linear_transformation(probabilities, a = 1000, b=100)

# of course it does not work,
pmml_pipeline = PMMLPipeline([
    # here we should place the category preprocesor; I know it does not work but , so you can get the idea
  ('decisiontree',model),
    ('postprocesing_apply_linear_transformation',postprocessig_linear_transformation)
])
sklearn2pmml(pmml_pipeline, "example_pipeline_pmml.pmml", with_repr = True)

Solution

  • On a second thought, you don't need a full-blown LinearRegression step to perform a deterministic a * x + b probability scaling operation. A simple ExpressionTransformer step is more than adequate:

    from sklearn2pmml.preprocessing import ExpressionTransformer
    
    pipeline = PMMLPipeline([
      ("decisiontree", model)
    ], predict_proba_transformer = ExpressionTransformer("X[0] * 1000 + 100"))