pythonscikit-learnpipelinegridsearchcvfeature-engineering

How do I add external features to my pipeline?


There is a similar question asked here on SO many years back but there was no answer. I have the same question. I would like to add in new column(s) of data, in my case 3 columns for dummy variables, to a sparse matrix (from TfidfVectorizer), before building a Pipeline and conducting a GridSearch to find the best hyper parameters.

Currently, I am able to do this model by model without GridSearch and Pipeline using the code below.

# this is an NLP project
X = df["text"] # column of text
y = df["target"] # continuous target variable
X_train, X_unseen, y_train, y_unseen = train_test_split(X, y, test_size=0.5, stratify=df_merged["platform"], random_state=42)

# vectorize
tvec = TfidfVectorizer(stop_words="english")
X_train_tvec = tvec.fit_transform(X_train)

# get dummies
dummies = pd.get_dummies(df["dummies"]).values
# add dummies to tvec sparse matrix
X_train_tvec_dumm = hstack([X_train_tvec, dummies]).toarray()

From here, I can fit my model onto the X_train_tvec_dumm training data which includes the sparse matrix (shape: n_rows, n_columns) of word vectors from TfidfVectorizer and 3 dummy columns. The final shape is therefore (n_rows, n_columns + 3).

I tried to build the Pipeline as follows.

# get dummies
dummies = pd.get_dummies(df["dummies"]).values

def add_dummies(matrix):
    return hstack([matrix, dummies]).toarray()


pipe = Pipeline([
    ("features", FeatureUnion([
        ("tvec", TfidfVectorizer(stop_words="english")),
        ("dummies", add_dummies(??))    <-- how do I add this step into the pipeline?
    ])),
    ("ridge", RidgeCV())
])

pipe_params = {
    'features__tvec__max_features': [200, 500],
    'features__tvec__ngram_range': [(1,1), (1,2)]
}

gs = GridSearchCV(pipe, param_grid=pipe_params, cv=4)
gs.fit(X_train, y_train)
print(gs.best_score_)

There is this tutorial that described how to build a custom transformer for the Pipeline but its custom function adds a new feature engineered by transforming on X_train. My dummy variables are unfortunately external to the X_train set.


Solution

  • Instead of get_dummies from pandas use OneHotEncoder from sklearn.preprocessing, together with ColumnTransformer from sklearn.compose. Make a DataFrame with both 'text' and 'category|dummies' columns as features.

    OneHotEncoder expects integers type features. If your features are not to int type, first encode|map them to ints, then apply OneHotEncoder.

    # ...
    from sklearn.compose import ColumnTransformer
    from sklearn.pipeline import Pipeline
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import OneHotEncoder
    
    
    text_features = 'text'
    text_transformer = Pipeline(steps=[
        ('vectorizer', TfidfVectorizer(stop_words="english"))])
    
    categorical_features = ['category']
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))])
    
    preprocessor = ColumnTransformer(
        transformers=[
            ('text', text_transformer, text_features),
            ('cat', categorical_transformer, categorical_features)])
    
    
    pipe =  Pipeline(steps=[('preprocessor', preprocessor),
                       ("ridge", RidgeCV())
    ])
    
    # ...