python-3.xscikit-learnscikit-learn-pipeline

Unabl to use Lambda in Scikit learn Pipeline


I have a pipeline which uses lambda functions:

preprocess_ppl = ColumnTransformer(
    transformers=[
        ('encode', categorical_transformer, make_column_selector(dtype_include=object)),
        ('zero_impute', fill_na_zero_transformer, lambda X: [col for col in fill_zero_cols if col in X.columns] ),
        ('numeric', numeric_transformer, lambda X: [col for col in num_cols if col in X.columns])
    ]
)
pipeline2 = Pipeline(
    steps=[
        ('dropper', drop_cols),
        ('remover',feature_remover),
        ("preprocessor", preprocess_ppl),
        ("estimator", customOLS(sm.OLS))
        ]
)

Basically, the lambda functions selects/subsets the columns only if the columns are present in X. Sometimes some columns are removed by intermediate step and it is possible that the a column in num_cols was removed hence I use lambda function to select only the present columns.

The problem is, the lambda function is not serializable and I have to use pickle I cannot use dill. Is there any other way of doing these lamda functions?


Solution

  • Don't use lambda, just use the list fill_zero_cols for 'zero_impute' and num_cols for 'numeric'.

    After all your lambda is just checking if each of the column names are in X.columns before processing. But I'm sure that if you try to process an input with missing features, your model will break anyway.

    So, your only solution is just pre-defining a list with each column types you want to process. This will give you consistent results.

    You have to ensure that your functions will fill the null values. Be extremely careful on how you fill those gaps in the data that doesn't exist and your concept has to be valid.

    Ensure you pre-process your data (fill the nulls) before you train the model or in your pipeline, using the callable function