I want to use multiple PMMLs to keep the transformation of the data and the application of the model separate. Here is the code I am using. I am doing this because I want to include some kind of winsorizing on my data.
train_stats = {}
continous_domains = []
for cont in con_vars:
# REMOVE -1 values to not distort the quantiles
cont_val = np.asarray(train_data_sub[train_data_sub[cont] != -1][cont])
_95 = np.percentile(cont_val[~np.isnan(cont_val)], 95)
_05 = np.percentile(cont_val[~np.isnan(cont_val)], 5)
_50 = np.percentile(cont_val[~np.isnan(cont_val)], 50)
train_stats[cont] = [_05, _50, _95]
continous_domains.append(
([cont], [
ContinuousDomain(
missing_values = [-1],
missing_value_treatment="as_value",
missing_value_replacement= _50,
outlier_treatment ='as_extreme_values',
low_value = _05,
high_value = _95,
dtype = float
)
]))
data_mapper = DataFrameMapper(continous_domains, df_out=True)
data_mapper.fit(train_data_sub)
data_mapper.transform(train_data_sub)
pmml_pipeline = PMMLPipeline(steps = [
('DataframeMapper', data_mapper)])
path_name = f"Trafo_{str(datetime.now().strftime('%Y_%m_%d_%H-%M'))}.pmml"
sklearn2pmml(pmml_pipeline, path_name, debug=True)
It kinda works but the returned .pmml file only includes the data dictionary without the specifications of the setted high/low values etc.
Interestingly when I am putting a LogisticRegression inside the Pipeline I get a correct looking PMML but there are only two outputs and I actually want to get all the transformed values from the Dataframe Mapper.
Can anyone help me here? Really struggling to find a solution.
Thank you so much and best regards
Paul
the returned .pmml file does only include the data dictionary without the specifications of the setted high/low values etc.
The domain specification gets stored in two places inside the PMML document.
First, the decision whether some input value is valid or invalid, gets encoded into the /PMML/DataDictionary/DataField
element. For example, specifying that "Sepal.Length" only accepts values within the [4.3, 7.9]
range:
<DataField name="Sepal_Length" displayName="Sepal length in cm" optype="continuous" dataType="double">
<Interval closure="closedClosed" leftMargin="4.3" rightMargin="7.9"/>
</DataField>
Second, the decision how the pre-categorized input value (ie. valid/invalid/missing) should be handled by any given model. This is encoded into the /PMML/<Model>/MiningSchema/MiningField
elements. For example, specifying that invalid and missing values are not permitted:
<MiningSchema>
<MiningField name="Sepal_Length" invalidValueTreatment="returnInvalid" missingValueTreatment="returnInvalid"/>
</MiningSchema>
Your pipeline does not contain a model object, so there is physically no place where the second part of the domain specification could be stored.
Interestingly when I am putting a LogisticRegression inside the Pipeline I get a correct looking PMML but there I only have two outputs...
The SkLearn2PMML package performs PMML optimization, where unused field declarations are automatically removed in order to spare human and computer cognition efforts.
Looks like your LogisticRegression
model object only uses two input fields then.
You could try replacing LogisticRegression
with some other model type that uses all input fields. Some "greedy" algorithm such as random forest might be a good fit.
TLDR: You can't separate domain decorators such as ContinuousDomain
from the actual model object. If you want to use pre-fitted domain decorators in multiple pipelines, then you should store them as pure Python objects in Pickle or Dill data formats. Alternatively, generate some utility function (and package it as a Python library) that you can import and activate conveniently.