pythonscikit-learnscikit-learn-pipeline

Pass input from one step to other step in Column transformer scikit pipeline


I have a pipeline that looks like this:

categorical_transformer = Pipeline(steps=[
    ('categorical_imputer', SimpleImputer(strategy="constant", fill_value='Unknown')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

fill_na_zero_transformer = Pipeline(steps=[
    ('zero_imputer', SimpleImputer(strategy='constant', fill_value=0))
])

numeric_transformer = Pipeline(steps=[
       ('imputer', SimpleImputer(strategy = "constant", fill_value=-1, add_indicator=True)),
       ('scaler', StandardScaler())
])

preprocess_ppl = ColumnTransformer(
    transformers=[
        ('categorical', categorical_transformer, ['MARITAL_STATUS']),
        ('zero_impute', fill_na_zero_transformer, fill_zero_cols),
        ('numeric', numeric_transformer, num_cols)
    ]
)

pipeline = Pipeline(
    steps=[
        ('dropper', drop_cols),
        ('remover',feature_remover),
        ("preprocessor", preprocess_ppl),
        ("estimator", LinearRegression())]
)

dropper drops some cols, feature remover also drops based on some logic. In ('numeric', numeric_transformer, num_cols) instead of num_cols I want to get the latest transformed data columns and pass it to 'numeric'.

I.e Lets say that the intermediate data before 'numeric' step is X I want to pass

[col for col in num_cols if col in X.columns]

instead of num_cols

Is this possible?


Solution

  • Yes; from the docs, the options for the columns entry of the transformers triples:

    columns : str, array-like of str, int, array-like of int, array-like of bool, slice or callable

    [...] A callable is passed the input data X and can return any of the above. [...]

    So at the most basic, lambda X: [col for col in num_cols if col in X.columns] should work.