pandas scikit-learn sklearn-pandas pmml sklearn2pmml

How to transform Dataframe Mapper to PMML?

I want to use multiple PMMLs to keep the transformation of the data and the application of the model separate. Here is the code I am using. I am doing this because I want to include some kind of winsorizing on my data.

train_stats = {}
continous_domains = []
for cont in con_vars: 
    # REMOVE -1 values to not distort the quantiles
    cont_val = np.asarray(train_data_sub[train_data_sub[cont] != -1][cont])
    _95 = np.percentile(cont_val[~np.isnan(cont_val)], 95)
    _05 = np.percentile(cont_val[~np.isnan(cont_val)], 5)
    _50 = np.percentile(cont_val[~np.isnan(cont_val)], 50)
    train_stats[cont] = [_05, _50, _95]
    continous_domains.append(
        ([cont], [
            ContinuousDomain(
                missing_values = [-1], 
                missing_value_treatment="as_value", 
                missing_value_replacement= _50,
                outlier_treatment ='as_extreme_values', 
                low_value = _05, 
                high_value = _95, 
                dtype = float
                )
            ]))

data_mapper = DataFrameMapper(continous_domains, df_out=True)
data_mapper.fit(train_data_sub)

data_mapper.transform(train_data_sub)
pmml_pipeline = PMMLPipeline(steps = [
    ('DataframeMapper', data_mapper)])
path_name = f"Trafo_{str(datetime.now().strftime('%Y_%m_%d_%H-%M'))}.pmml"
sklearn2pmml(pmml_pipeline, path_name, debug=True)

It kinda works but the returned .pmml file only includes the data dictionary without the specifications of the setted high/low values etc.

Interestingly when I am putting a LogisticRegression inside the Pipeline I get a correct looking PMML but there are only two outputs and I actually want to get all the transformed values from the Dataframe Mapper.

Can anyone help me here? Really struggling to find a solution.

Thank you so much and best regards

Paul

Solution

the returned .pmml file does only include the data dictionary without the specifications of the setted high/low values etc.

The domain specification gets stored in two places inside the PMML document.

First, the decision whether some input value is valid or invalid, gets encoded into the /PMML/DataDictionary/DataField element. For example, specifying that "Sepal.Length" only accepts values within the [4.3, 7.9] range:

<DataField name="Sepal_Length" displayName="Sepal length in cm" optype="continuous" dataType="double">
  <Interval closure="closedClosed" leftMargin="4.3" rightMargin="7.9"/>
</DataField>

Second, the decision how the pre-categorized input value (ie. valid/invalid/missing) should be handled by any given model. This is encoded into the /PMML/<Model>/MiningSchema/MiningField elements. For example, specifying that invalid and missing values are not permitted:

<MiningSchema>
  <MiningField name="Sepal_Length" invalidValueTreatment="returnInvalid" missingValueTreatment="returnInvalid"/>
</MiningSchema>

Your pipeline does not contain a model object, so there is physically no place where the second part of the domain specification could be stored.

Interestingly when I am putting a LogisticRegression inside the Pipeline I get a correct looking PMML but there I only have two outputs...

The SkLearn2PMML package performs PMML optimization, where unused field declarations are automatically removed in order to spare human and computer cognition efforts.

Looks like your LogisticRegression model object only uses two input fields then.

You could try replacing LogisticRegression with some other model type that uses all input fields. Some "greedy" algorithm such as random forest might be a good fit.

TLDR: You can't separate domain decorators such as ContinuousDomain from the actual model object. If you want to use pre-fitted domain decorators in multiple pipelines, then you should store them as pure Python objects in Pickle or Dill data formats. Alternatively, generate some utility function (and package it as a Python library) that you can import and activate conveniently.