I have a PMMLPipeline with the following DataFrameMapper inside (Domains are coming from sklearn2pmml, while the Mapper is from sklearn-pandas):
{'features': [(['A'],
[ContinuousDomain(dtype=<class 'float'>, high_value=82274.69794050456,
low_value=1617.9990694391965,
missing_value_replacement=26693.17348049894,
missing_value_treatment='as_value', missing_values=[nan, -1],
outlier_treatment='as_extreme_values')]),
(['B'],
[ContinuousDomain(dtype=<class 'int'>, high_value=142, low_value=0,
missing_value_replacement=71,
missing_value_treatment='as_value', missing_values=[nan, -1],
outlier_treatment='as_extreme_values')]),
(['C'],
[ContinuousDomain(dtype=<class 'int'>, high_value=34, low_value=0,
missing_value_replacement=17,
missing_value_treatment='as_value', missing_values=[nan, -1],
outlier_treatment='as_extreme_values')]),
(['D'],
[ContinuousDomain(dtype=<class 'int'>, high_value=903, low_value=2,
missing_value_replacement=448,
missing_value_treatment='as_value', missing_values=[nan, -1],
outlier_treatment='as_extreme_values')]),
(['E'],
[ContinuousDomain(dtype=<class 'int'>, high_value=95, low_value=0,
missing_value_replacement=48,
missing_value_treatment='as_value', missing_values=[nan, -1],
outlier_treatment='as_extreme_values')])],
'default': False,
'built_default': False,
'sparse': False,
'df_out': False,
'input_df': False,
'drop_cols': [],
'transformed_names_': ['A',
'B',
'C',
'D',
'E'],
'built_features': [(['A'],
TransformerPipeline(steps=[('continuousdomain',
ContinuousDomain(dtype=<class 'float'>,
high_value=82274.69794050456,
low_value=1617.9990694391965,
missing_value_replacement=26693.17348049894,
missing_value_treatment='as_value',
missing_values=[nan, -1],
outlier_treatment='as_extreme_values'))]),
{}),
(['B'],
TransformerPipeline(steps=[('continuousdomain',
ContinuousDomain(dtype=<class 'int'>,
high_value=142, low_value=0,
missing_value_replacement=71,
missing_value_treatment='as_value',
missing_values=[nan, -1],
outlier_treatment='as_extreme_values'))]),
{}),
(['C'],
TransformerPipeline(steps=[('continuousdomain',
ContinuousDomain(dtype=<class 'int'>, high_value=34,
low_value=0,
missing_value_replacement=17,
missing_value_treatment='as_value',
missing_values=[nan, -1],
outlier_treatment='as_extreme_values'))]),
{}),
(['D'],
TransformerPipeline(steps=[('continuousdomain',
ContinuousDomain(dtype=<class 'int'>,
high_value=903, low_value=2,
missing_value_replacement=448,
missing_value_treatment='as_value',
missing_values=[nan, -1],
outlier_treatment='as_extreme_values'))]),
{}),
(['E'],
TransformerPipeline(steps=[('continuousdomain',
ContinuousDomain(dtype=<class 'int'>, high_value=95,
low_value=0,
missing_value_replacement=48,
missing_value_treatment='as_value',
missing_values=[nan, -1],
outlier_treatment='as_extreme_values'))]),
{})]}
Now I wanted to transform the following a one row pandas df
test_inst = {
'A': [53.51370],
'B': [28],
'C': [7],
'D': [655],
'E': [81]
}
test_pd = pd.DataFrame.from_dict(test_inst)
wrap.transform(test_pd)
But for A = 53.51370 I get
ValueError: ['A']: Data contains 1 invalid values
while for A = 53.51371 it works as expected.
I am really not sure why it behaves likes this because bowth values are outside of [low, high] and should anyway be treated as outliers.
Would really appreciate any kind of hint to the root cause of the problem.
Thanks a lot in advance and BR Paul
First of all, you should be presenting the Python code about how you construct your DataFrameMapper
object, not its print-out.
Did you fit the DataFrameMapper
object before using it for transformations? Right now you're calling transform(X)
but the earlier call to fit(X)
is nowhere to be seen.
Calling the fit is important, because SkLearn2PMML domain decorator classes also learn from data. In the current case, they would be learning the "natural bounds" of the valid value space.
Anyway, with SkLearn2PMML version 0.101.0, the transformation for the "A" column appears to work just fine. The result is 1617.99906944
, which corresponds to the lower bound of the outlier treatment.